Table of contents

  1. Importing Packages
  2. Exploratory Data Analysis
  3. Cases
    3.1. Overview of the cases in Korea
    3.2. Check the relationship between group and solo cases
    3.3. Source of infected cases sort by source and province
    3.4. Risk Level of each province
  4. Patient Info
    4.1. Numbers Of Cases Per Day
    4.2. Average Survivor Treatment Day (Recovery Speed), By Age and Gender
    4.3. Average Survivor Treatment Day (Recovery Speed) By City & Age
    4.4. Average Deceased Treatment Day (Fatality Speed) By Gender and Age
    4.5. Average Deceased Treatment Day (Fatality Speed) By City and Age
    4.6. Number of Infection, Survivor, Deceased Per City by percentage
    4.7. Number of Infection, Survivor, Deceased Sort By Gender
    4.8. Network Diagram
  5. Policy
  6. Time
    6.1. Test & Negative cases daily average increase amount by day
    6.2. Test & Negative cases daily average increase amount by month
    6.3. Confirmed, released & deceased cases daily average increase amount by day
    6.4. Confirmed, released & deceased cases daily average increase amount by month
  7. Time Age
    7.1. Deceased vs confirmed cases across different age group over time
  8. Time Gender
    8.1. Deceased vs confirmed cases across different gender over time
    8.2. Confirmed cases daily average increase amount by month (Sort by gender)
    8.3. Confirmed cases daily average increase amount by day (Sort by gender)
    8.4. Deceased cases daily average increase amount by month (Sort by gender)
    8.5. Deceased cases daily average increase amount by day (Sort by gender)
  9. Time Province
    9.1. Deceased vs confirmed cases across different province over time (With Daegu)
    9.2. Deceased vs confirmed cases across different province over time (Without Daegu)
  10. In depth Analysis Of Daegu Cases using Technical Analysis
    10.1 Technical Analysis
    10.2 Stats By Month
    10.3 Stats By Day
  11. Time Series World Map Visualisation of COVID-19 cases in korea
    11.1 With Animation
    11.2 Without animation
  12. Weather
  13. Region
    13.1 Busan Avg Temp (Mean) vs Most Wind Direction (Mean) Analysis for days with wind direction greater than 100
    13.2 Does other region has the same relationship between avg_temp and most_wind direction?
  14. Web Scalping: Comparsion across the world
    14.1 Getting the data from the table
    14.2 Covert to data
    14.3 Worldwide Total Cases Chart
    14.4 Worldwide Total Death Chart
    14.5 Scatterplot between total cases and total death
  15. Summarise Key findings from Exploratory Data Analysis
    15.1 Overall Cases in Korea
    15.2 Patient Cases
    15.3 Policy
    15.4 Cases Over Time Analysis
    15.5 Cases Over Age & Time Analysis
    15.6 Cases Over Gender & Time Analysis
    15.7 Cases Over Province & Time Analysis
    15.8 Weather
    15.9 Number of nursing home vs Population Ratio
  16. Preparing the data for modelling
    16.1 Get the necessary columns
    16.2 Calculate number of days of treatment before dropped date column
    16.3 Convert state, gender and gender columns to categorical data
    16.4 Get dummy for province
  17. Correlation Matrix
  18. Building Machine Learning Models Part 1
    18.1 Stochastic Gradient Descent (SGD)
    18.2 Random Forest
    18.3 Logistic Regression
    18.4 Gaussian Naive Bayes
    18.5 K Nearest Neighbor
    18.6 Perceptron
    18.7 Linear Support Vector Machine
    18.8 Decision Tree
    18.9 Getting the best model
    18.10 Decision Tree Diagram
    18.11 What is the most important feature ?
    18.12 Confusion Matrix with Precision & Recall & F-Score
  19. Building Machine Learning Models Part 2
    19.1 Random Forest
    19.2 Decision Tree
    19.3 Getting the best model
    19.4 Decision Tree Diagram
    19.5 Importance
    19.6 Confusion Matrix with Precision & Recall & F-Score
  20. Summary between the old and new model
  21. Step to take to control the COVID-19 situation even better
    21.1 Abraham Wald and the Missing Bullet Holes
    21.2 Survivorship Bias
    21.3 Improving the situation
  22. Bonus: Refine Dataset for Machine Learning
    22.1 Preparing the data for modelling
    22.2 Building Machine Learning Models Part 3
    22.3 Importances
    22.4 Confusion Matrix with Precision & Recall & F-Score
    22.5 Precision Recall Curve
  23. Final Summary
  24. References
  25. Appendix
  26. Contribution Statements
In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

Importing Packages

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import networkx as nx

import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
px.set_mapbox_access_token(open("./mapbox_token").read())

from bs4 import BeautifulSoup
from warnings import filterwarnings
filterwarnings('ignore')
from datetime import datetime 
import requests

# Algorithms
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

# Matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve

# Diagram
from sklearn.tree import export_graphviz
import pydotplus
from sklearn.externals.six import StringIO  
from IPython.display import Image

Exploratory Data Analysis

Case

Overview of the cases in Korea

In [3]:
caseData = pd.read_csv('covid/case.csv')
caseDataForMap = caseData.copy()
caseDataForMap = caseDataForMap[caseDataForMap[['latitude', 'longitude']] != '-']

cols = ['latitude', 'longitude']
caseData[cols] = caseData[cols].apply(pd.to_numeric, errors='coerce', axis=1)
display(caseData.head())

fig = px.scatter_mapbox(
    caseData[caseData.latitude != '-'], 
    text="<br>City: " + caseData["city"] +" <br>Province: " + caseData["province"],
    lat="latitude", 
    lon="longitude",     
    color="confirmed", 
    size="confirmed",
    color_continuous_scale=px.colors.sequential.Burg,
    size_max=100, 
    mapbox_style='dark',
    zoom=6,
    title="Overview of the cases in Korea")
fig.show()
case_id province city group infection_case confirmed latitude longitude
0 1000001 Seoul Yongsan-gu True Itaewon Clubs 139 37.538621 126.992652
1 1000002 Seoul Gwanak-gu True Richway 119 37.482080 126.901384
2 1000003 Seoul Guro-gu True Guro-gu Call Center 95 37.508163 126.884387
3 1000004 Seoul Yangcheon-gu True Yangcheon Table Tennis Club 43 37.546061 126.874209
4 1000005 Seoul Dobong-gu True Day Care Center 43 37.679422 127.044374

Check the relationship between group and solo cases

In [4]:
caseData = pd.read_csv('covid/case.csv')
caseDataForMap = caseData.copy()
caseDataSorted = caseDataForMap.sort_values(by=['confirmed'], ascending=False)
display(caseDataSorted.head())

sortedValues = caseDataForMap.groupby(['province','group']).sum().sort_values(by=['confirmed'], ascending=False).reset_index()
sortedValues = sortedValues[['province','confirmed','group']]
display(sortedValues.head())

fig = px.bar(sortedValues, x='confirmed', y='province', color='group', barmode='group')
fig.update_layout(hovermode='y')
fig.show()
case_id province city group infection_case confirmed latitude longitude
48 1200001 Daegu Nam-gu True Shincheonji Church 4511 35.84008 128.5667
56 1200009 Daegu - False contact with patient 917 - -
57 1200010 Daegu - False etc 747 - -
145 6000001 Gyeongsangbuk-do from other city True Shincheonji Church 566 - -
109 2000020 Gyeonggi-do - False overseas inflow 305 - -
province confirmed group
0 Daegu 4975 True
1 Daegu 1705 False
2 Gyeongsangbuk-do 979 True
3 Seoul 720 True
4 Seoul 560 False

Source of infected cases sort by source and province

In [5]:
locations = caseData.pivot("province", "infection_case", "confirmed")
display(locations.head())

f, ax = plt.subplots(figsize=(20, 5))
sns.heatmap(locations, cmap="gnuplot", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Numbers of cases sort by source and province')
infection_case Anyang Gunpo Pastors Group Biblical Language study meeting Bonghwa Pureun Nursing Home Bundang Jesaeng Hospital Changnyeong Coin Karaoke Cheongdo Daenam Hospital Coupang Logistics Center Daejeon door-to-door sales Daesil Convalescent Hospital Daezayeon Korea ... Yangcheon Table Tennis Club Yechun-gun Yeonana News Class Yeongdeungpo Learning Institute Yongin Brothers contact with patient etc gym facility in Cheonan gym facility in Sejong overseas inflow
province
Busan NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 19.0 30.0 NaN NaN 36.0
Chungcheongbuk-do NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 8.0 11.0 NaN NaN 13.0
Chungcheongnam-do NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2.0 12.0 103.0 NaN 16.0
Daegu NaN NaN NaN NaN NaN 2.0 NaN NaN 101.0 NaN ... NaN NaN NaN NaN NaN 917.0 747.0 NaN NaN 41.0
Daejeon NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 15.0 15.0 NaN NaN 15.0

5 rows × 81 columns

Out[5]:
Text(0.5, 1, 'Numbers of cases sort by source and province')
In [6]:
# Used log, because daegu has high amount of infection which highly skewed the data
caseData['logConfirmed'] = np.log(caseData['confirmed'])
locations = caseData.pivot("province", "infection_case", "logConfirmed")
display(locations.head())

f, ax = plt.subplots(figsize=(20, 5))
sns.heatmap(locations, cmap="gnuplot", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Numbers of cases sort by source and province')
infection_case Anyang Gunpo Pastors Group Biblical Language study meeting Bonghwa Pureun Nursing Home Bundang Jesaeng Hospital Changnyeong Coin Karaoke Cheongdo Daenam Hospital Coupang Logistics Center Daejeon door-to-door sales Daesil Convalescent Hospital Daezayeon Korea ... Yangcheon Table Tennis Club Yechun-gun Yeonana News Class Yeongdeungpo Learning Institute Yongin Brothers contact with patient etc gym facility in Cheonan gym facility in Sejong overseas inflow
province
Busan NaN NaN NaN NaN NaN 0.000000 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2.944439 3.401197 NaN NaN 3.583519
Chungcheongbuk-do NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2.079442 2.397895 NaN NaN 2.564949
Chungcheongnam-do NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 0.693147 2.484907 4.634729 NaN 2.772589
Daegu NaN NaN NaN NaN NaN 0.693147 NaN NaN 4.615121 NaN ... NaN NaN NaN NaN NaN 6.821107 6.616065 NaN NaN 3.713572
Daejeon NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2.708050 2.708050 NaN NaN 2.708050

5 rows × 81 columns

Out[6]:
Text(0.5, 1, 'Numbers of cases sort by source and province')

Risk Level of each province

In [7]:
caseData = pd.read_csv('covid/case.csv')
caseDataForMap = caseData.copy()
caseDataForMap['logConfirmed'] = np.log(caseDataForMap['confirmed'])
display(caseDataForMap.head())

fig = px.box(caseDataForMap, x="logConfirmed", y="province")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.update_layout(hovermode='y')
fig.show()
case_id province city group infection_case confirmed latitude longitude logConfirmed
0 1000001 Seoul Yongsan-gu True Itaewon Clubs 139 37.538621 126.992652 4.934474
1 1000002 Seoul Gwanak-gu True Richway 119 37.48208 126.901384 4.779123
2 1000003 Seoul Guro-gu True Guro-gu Call Center 95 37.508163 126.884387 4.553877
3 1000004 Seoul Yangcheon-gu True Yangcheon Table Tennis Club 43 37.546061 126.874209 3.761200
4 1000005 Seoul Dobong-gu True Day Care Center 43 37.679422 127.044374 3.761200
In [8]:
caseDataForMap = caseData.copy()
sortedValues = caseDataForMap.groupby(['province']).sum().sort_values(by=['confirmed'], ascending=False).reset_index()
sortedValues['logConfirmed'] = np.log(sortedValues['confirmed'])
sortedValues = sortedValues[['province','confirmed','logConfirmed']]
display(sortedValues.head())
province confirmed logConfirmed
0 Daegu 6680 8.806873
1 Gyeongsangbuk-do 1324 7.188413
2 Seoul 1280 7.154615
3 Gyeonggi-do 1000 6.907755
4 Incheon 202 5.308268
In [9]:
print('Mean: ', sortedValues.mean().values)
print('Standard Derivation: ', sortedValues.std().values)
Mean:  [670.29411765   4.96126521]
Standard Derivation:  [1610.8280776     1.65884891]

This amount of COVID-19 cases has a mean of 670 and a standard deviation of 1610.

Hence we sort the provinces into 4 different types of level:

  • Very Risk = 4 (Above 2000 cases)
  • High Risk = 3 (1000 - 2000 cases)
  • Medium Risk = 2 (100 - 999 cases)
  • Low Risk = 1 ( Below 99 cases)
In [10]:
def calculate_risk_level(sortedValues) :
    if sortedValues['confirmed'] >= 2000:
        return 4
    elif sortedValues['confirmed'] >= 1000 and sortedValues['confirmed'] < 2000:
        return 3
    elif sortedValues['confirmed'] >= 100 and sortedValues['confirmed'] < 900:
        return 2
    else:
        return 1
    
sortedValues['Risk Level'] = sortedValues.apply(calculate_risk_level,axis=1);
sortedValues = sortedValues[['province','Risk Level']]
display(sortedValues.head())
province Risk Level
0 Daegu 4
1 Gyeongsangbuk-do 3
2 Seoul 3
3 Gyeonggi-do 3
4 Incheon 2

Patient Info

Numbers Of Cases Per Day

In [11]:
patientData = pd.read_csv('covid/patientinfo.csv')
display(patientData.head())

groupPatientData = patientData.groupby('confirmed_date').size().reset_index()
groupPatientData.columns = ['confirmed_date', 'count']

fig = px.line(groupPatientData, x="confirmed_date", y="count", title='Numbers Of Covid Cases Per Day')
fig.update_layout(hovermode='x')
fig.show()
patient_id sex age country province city infection_case infected_by contact_number symptom_onset_date confirmed_date released_date deceased_date state
0 1000000001 male 50s Korea Seoul Gangseo-gu overseas inflow NaN 75 2020-01-22 2020-01-23 2020-02-05 NaN released
1 1000000002 male 30s Korea Seoul Jungnang-gu overseas inflow NaN 31 NaN 2020-01-30 2020-03-02 NaN released
2 1000000003 male 50s Korea Seoul Jongno-gu contact with patient 2002000001 17 NaN 2020-01-30 2020-02-19 NaN released
3 1000000004 male 20s Korea Seoul Mapo-gu overseas inflow NaN 9 2020-01-26 2020-01-30 2020-02-15 NaN released
4 1000000005 female 20s Korea Seoul Seongbuk-gu contact with patient 1000000002 2 NaN 2020-01-31 2020-02-24 NaN released

Average Survivor Treatment Day (Recovery Speed), By Age and Gender

In [12]:
confinedDaysAnalysis = patientData.copy()
confinedDaysAnalysis = confinedDaysAnalysis[confinedDaysAnalysis['age'] != '100s'] #Just a single data which is highly skewed
confinedDaysAnalysis = confinedDaysAnalysis[confinedDaysAnalysis['released_date'].notnull()]

cols = ['released_date', 'confirmed_date']
confinedDaysAnalysis[cols] = confinedDaysAnalysis[cols].apply(pd.to_datetime, errors='coerce', axis=1)
confinedDaysAnalysis['Total Treatment Days'] = (confinedDaysAnalysis['released_date'] - confinedDaysAnalysis['confirmed_date']).dt.days

groupedData = confinedDaysAnalysis.groupby(['sex', 'age'])['Total Treatment Days'].mean().unstack().stack().reset_index()
groupedData.columns = ["sex", "age", "Average Treatment Days"]

dataForHeatmap = groupedData.pivot("sex", "age", "Average Treatment Days")
display(dataForHeatmap.head())

f, ax = plt.subplots(figsize=(15, 5))
sns.heatmap(dataForHeatmap, cmap="RdPu", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Average Survivor Treatment Day, Sort by Age & Gender')
age 0s 10s 20s 30s 40s 50s 60s 70s 80s 90s
sex
female 28.375000 19.80 23.659091 23.415094 22.025806 23.310345 26.543689 31.344828 33.738095 28.461538
male 21.727273 21.45 23.275132 22.971429 26.581395 24.666667 27.257143 36.200000 37.055556 33.000000
Out[12]:
Text(0.5, 1, 'Average Survivor Treatment Day, Sort by Age & Gender')
In [13]:
confinedDaysAnalysis['age'] = confinedDaysAnalysis['age'].dropna()
age = {'0s':0, '10s':1, '20s':2, '30s':3, '40s':4, '50s':5, '60s':6, '70s':7, '80s':8, '90s':9}
confinedDaysAnalysis['age'] = confinedDaysAnalysis['age'].map(age)
confinedDaysAnalysis = confinedDaysAnalysis[confinedDaysAnalysis['age'].notna()]

fig = px.scatter(confinedDaysAnalysis, x="age", y="Total Treatment Days", color="sex", trendline="ols", title="Does age group affect total treatment days?")
fig.update_layout(hovermode='y')
fig.show()

Average Survivor Treatment Day (Recovery Speed) By City & Age

In [14]:
groupedData = confinedDaysAnalysis.groupby(['province', 'age'])['Total Treatment Days'].mean().unstack().stack().reset_index()
groupedData.columns = ["province", "age", "Average Treatment Days"]
groupedData['Average Treatment Days'] = np.ceil(groupedData['Average Treatment Days'].apply(pd.to_numeric, errors='coerce'))

dataForHeatmap = groupedData.pivot("province", "age", "Average Treatment Days")
display(dataForHeatmap.head())

f, ax = plt.subplots(figsize=(20, 5))
sns.heatmap(dataForHeatmap, cmap="RdPu", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Average Survivor Treatment Day, Sort by City & Age')
age 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
province
Chungcheongbuk-do 49.0 20.0 24.0 23.0 23.0 25.0 23.0 26.0 26.0 20.0
Chungcheongnam-do 17.0 23.0 23.0 26.0 24.0 27.0 27.0 43.0 9.0 NaN
Daegu NaN NaN NaN NaN NaN 9.0 7.0 12.0 NaN NaN
Daejeon NaN 15.0 22.0 23.0 18.0 26.0 40.0 24.0 17.0 NaN
Gangwon-do NaN NaN 12.0 24.0 27.0 19.0 25.0 NaN NaN NaN
Out[14]:
Text(0.5, 1, 'Average Survivor Treatment Day, Sort by City & Age')

Average Deceased Treatment Day (Fatality Speed) By Gender and Age

In [15]:
confinedDaysAnalysis = patientData.copy()
confinedDaysAnalysis = confinedDaysAnalysis[confinedDaysAnalysis['age'] != '100s'] #Just a single data which is highly skewed
confinedDaysAnalysis = confinedDaysAnalysis[confinedDaysAnalysis['deceased_date'].notnull()]

cols = ['deceased_date', 'confirmed_date']
confinedDaysAnalysis[cols] = confinedDaysAnalysis[cols].apply(pd.to_datetime, errors='coerce', axis=1)
confinedDaysAnalysis['Total Treatment Days'] = (confinedDaysAnalysis['deceased_date'] - confinedDaysAnalysis['confirmed_date']).dt.days
display(confinedDaysAnalysis.head())

groupedData = confinedDaysAnalysis.groupby(['sex', 'age'])['Total Treatment Days'].mean().unstack().stack().reset_index()
groupedData.columns = ["sex", "age", "Average Treatment Days"]

dataForHeatmap = groupedData.pivot("sex", "age", "Average Treatment Days")
display(dataForHeatmap.head())

f, ax = plt.subplots(figsize=(15, 5))
sns.heatmap(dataForHeatmap, cmap="RdPu", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Average Deceased Treatment Day, Sort by Age & Gender')
patient_id sex age country province city infection_case infected_by contact_number symptom_onset_date confirmed_date released_date deceased_date state Total Treatment Days
1468 1200000038 female 50s Korea Daegu Nam-gu NaN NaN NaN NaN 2020-02-18 NaN 2020-02-23 deceased 5
1507 1200000114 male 70s Korea Daegu NaN Shincheonji Church NaN NaN NaN 2020-02-21 NaN 2020-02-26 deceased 5
1508 1200000620 male 70s Korea Daegu NaN NaN NaN NaN NaN 2020-02-24 NaN 2020-03-02 deceased 7
1509 1200000901 female 80s Korea Daegu NaN NaN NaN NaN NaN 2020-02-25 NaN 2020-03-04 deceased 8
1510 1200001064 female 70s Korea Daegu NaN NaN NaN NaN NaN 2020-02-26 NaN 2020-03-01 deceased 4
age 30s 50s 60s 70s 80s 90s
sex
female NaN 1.333333 8.0 27.0 11.615385 10.0
male 0.0 8.000000 10.9 5.5 10.666667 10.5
Out[15]:
Text(0.5, 1, 'Average Deceased Treatment Day, Sort by Age & Gender')

Average Deceased Treatment Day (Fatality Speed) By City and Age

In [16]:
groupedData = confinedDaysAnalysis.groupby(['province', 'age'])['Total Treatment Days'].mean().unstack().stack().reset_index()
groupedData.columns = ["province", "age", "Average Treatment Days"]
groupedData['Average Treatment Days'] = np.ceil(groupedData['Average Treatment Days'].apply(pd.to_numeric, errors='coerce'))

dataForHeatmap = groupedData.pivot("province", "age", "Average Treatment Days")
display(dataForHeatmap.head())

f, ax = plt.subplots(figsize=(20, 5))
sns.heatmap(dataForHeatmap, cmap="RdPu", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Average Deceased Treatment Day, Sort by City & Age')
age 30s 50s 60s 70s 80s 90s
province
Daegu NaN 3.0 6.0 4.0 5.0 6.0
Daejeon NaN NaN NaN 52.0 NaN NaN
Gangwon-do NaN NaN NaN 15.0 24.0 NaN
Gyeonggi-do 0.0 NaN NaN NaN NaN NaN
Gyeongsangbuk-do NaN 7.0 11.0 17.0 13.0 12.0
Out[16]:
Text(0.5, 1, 'Average Deceased Treatment Day, Sort by City & Age')

Number of Infection, Survivor, Deceased Per City by percentage

In [17]:
survivorCountAnalysis = patientData.copy()

survivorCountAnalysis['survive'] = survivorCountAnalysis['released_date'].notnull()
survivorCountAnalysis['deceased'] = survivorCountAnalysis['deceased_date'].notnull()
survivorCountAnalysis['under treatment'] = survivorCountAnalysis['deceased_date'].isnull() & survivorCountAnalysis['released_date'].isnull()

provinceStats = survivorCountAnalysis.groupby(['province']).sum()
provinceStatsClean = provinceStats[['survive', 'deceased', 'under treatment']]

provinceStatsClean['survive %'] = np.round(provinceStatsClean['survive'] / (provinceStatsClean['survive'] + provinceStatsClean['deceased'] + provinceStatsClean['under treatment']) * 100, 2)
provinceStatsClean['deceased %'] = np.round(provinceStatsClean['deceased'] / (provinceStatsClean['survive'] + provinceStatsClean['deceased'] + provinceStatsClean['under treatment']) * 100, 2)
provinceStatsClean['under treatment %'] = np.round(provinceStatsClean['under treatment'] / (provinceStatsClean['survive'] + provinceStatsClean['deceased'] + provinceStatsClean['under treatment']) * 100, 2)

provinceStatsAbsolute = provinceStatsClean[['survive', 'deceased', 'under treatment']]
provinceStatsPercentage = provinceStatsClean[['survive %', 'deceased %', 'under treatment %']]

display(provinceStatsAbsolute.head())
display(provinceStatsPercentage.head())
survive deceased under treatment
province
Busan 0.0 0.0 151.0
Chungcheongbuk-do 50.0 0.0 6.0
Chungcheongnam-do 150.0 0.0 18.0
Daegu 4.0 20.0 113.0
Daejeon 44.0 1.0 74.0
survive % deceased % under treatment %
province
Busan 0.00 0.00 100.00
Chungcheongbuk-do 89.29 0.00 10.71
Chungcheongnam-do 89.29 0.00 10.71
Daegu 2.92 14.60 82.48
Daejeon 36.97 0.84 62.18
In [18]:
newTable = provinceStatsPercentage.reset_index()
display(newTable.head())

newTable = pd.melt(newTable, id_vars="province", var_name="stats", value_name="rate")
display(newTable.head())

fig = px.bar(newTable, x='rate', y='province', color='stats', barmode='group')
fig.update_layout(hovermode='y')
fig.show()
province survive % deceased % under treatment %
0 Busan 0.00 0.00 100.00
1 Chungcheongbuk-do 89.29 0.00 10.71
2 Chungcheongnam-do 89.29 0.00 10.71
3 Daegu 2.92 14.60 82.48
4 Daejeon 36.97 0.84 62.18
province stats rate
0 Busan survive % 0.00
1 Chungcheongbuk-do survive % 89.29
2 Chungcheongnam-do survive % 89.29
3 Daegu survive % 2.92
4 Daejeon survive % 36.97

Number of Infection, Survivor, Deceased Sort By Gender

In [19]:
survivorCountAnalysis = patientData.copy()

survivorCountAnalysis['survive'] = survivorCountAnalysis['released_date'].notnull()
survivorCountAnalysis['deceased'] = survivorCountAnalysis['deceased_date'].notnull()
survivorCountAnalysis['under treatment'] = survivorCountAnalysis['deceased_date'].isnull() & survivorCountAnalysis['released_date'].isnull()

provinceStats = survivorCountAnalysis.groupby(['sex']).sum()
provinceStatsClean = provinceStats[['survive', 'deceased', 'under treatment']]

provinceStatsClean['survive %'] = np.round(provinceStatsClean['survive'] / (provinceStatsClean['survive'] + provinceStatsClean['deceased'] + provinceStatsClean['under treatment']) * 100, 2)
provinceStatsClean['deceased %'] = np.round(provinceStatsClean['deceased'] / (provinceStatsClean['survive'] + provinceStatsClean['deceased'] + provinceStatsClean['under treatment']) * 100, 2)
provinceStatsClean['under treatment %'] = np.round(provinceStatsClean['under treatment'] / (provinceStatsClean['survive'] + provinceStatsClean['deceased'] + provinceStatsClean['under treatment']) * 100, 2)

provinceStatsAbsolute = provinceStatsClean[['survive', 'deceased', 'under treatment']]
provinceStatsPercentage = provinceStatsClean[['survive %', 'deceased %', 'under treatment %']]

display(provinceStatsAbsolute)
display(provinceStatsPercentage)
survive deceased under treatment
sex
female 909.0 26.0 1285.0
male 677.0 40.0 1108.0
survive % deceased % under treatment %
sex
female 40.95 1.17 57.88
male 37.10 2.19 60.71
In [20]:
newTable = provinceStatsPercentage.reset_index()
newTable = pd.melt(newTable, id_vars="sex", var_name="stats", value_name="rate")
display(newTable.head())

fig = px.bar(newTable, x='rate', y='sex', color='stats', barmode='group')
fig.update_layout(hovermode='y')
fig.show()
sex stats rate
0 female survive % 40.95
1 male survive % 37.10
2 female deceased % 1.17
3 male deceased % 2.19
4 female under treatment % 57.88

Network Diagram

In [21]:
networkData = patientData.copy()
networkData = networkData[networkData['infected_by'].notnull()]
networkData = networkData[['patient_id','sex','age','province','city','infection_case','infected_by','state']]
display(networkData.head())
patient_id sex age province city infection_case infected_by state
2 1000000003 male 50s Seoul Jongno-gu contact with patient 2002000001 released
4 1000000005 female 20s Seoul Seongbuk-gu contact with patient 1000000002 released
5 1000000006 female 50s Seoul Jongno-gu contact with patient 1000000003 released
6 1000000007 male 20s Seoul Jongno-gu contact with patient 1000000003 released
9 1000000010 female 60s Seoul Seongbuk-gu contact with patient 1000000003 released
In [22]:
A = list(networkData["infected_by"].unique())
B = list(networkData["patient_id"].unique())
node_list = set(A+B)

# Create Graph
G = nx.Graph()

for i in node_list:
    G.add_node(i)
# G.nodes()

for i,j in networkData.iterrows():
    G.add_edges_from([(j["infected_by"],j["patient_id"])])
    
pos = nx.spring_layout(G, k=0.5, iterations=50)

for n, p in pos.items():
    G.nodes[n]['pos'] = p

edge_trace = go.Scatter(
    x=[],
    y=[],
    line=dict(width=0.5,color='#888'),
    hoverinfo='none',
    mode='lines')

for edge in G.edges():
    x0, y0 = G.nodes[edge[0]]['pos']
    x1, y1 = G.nodes[edge[1]]['pos']
    edge_trace['x'] += tuple([x0, x1, None])
    edge_trace['y'] += tuple([y0, y1, None])

node_trace = go.Scatter(
    x=[],
    y=[],
    text=[],
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=True,
        colorscale='RdPu_r',
        reversescale=True,
        color=[],
        size=15,
        colorbar=dict(
            thickness=10,
            title='Numnber of Infected Cases',
            xanchor='left',
            titleside='right'
        ),
        line=dict(width=0)))

for node in G.nodes():
    x, y = G.nodes[node]['pos']
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])

for node, adjacencies in enumerate(G.adjacency()):
    node_trace['marker']['color']+=tuple([len(adjacencies[1])])
    node_info = str(adjacencies[0]) +' # of connections: '+ str(len(adjacencies[1]))
    node_trace['text']+=tuple([node_info])
    
fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                title='<br>Korea Covid Network Connections',
                titlefont=dict(size=16),
                showlegend=False,
                hovermode='closest',
                margin=dict(b=20,l=5,r=5,t=40),
                annotations=[ dict(
                    text="",
                    showarrow=False,
                    xref="paper", yref="paper") ],
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))

iplot(fig)

Policy

In [23]:
policyData = pd.read_csv('covid/policy.csv')
policyDataCopy = policyData.copy()
policyDataCopy['end_date'] = policyDataCopy['end_date'].fillna(datetime.now().strftime('%Y-%m-%d'))
display(policyDataCopy.head())

df = []

for index, row in policyDataCopy.iterrows():
    df.append(dict(Task=row['gov_policy'], Start=row['start_date'], Finish=row['end_date'], Resource=row['type']))

fig = px.timeline(df, x_start="Start", x_end="Finish", y="Task", color="Resource", title="Policy over time")
fig.update_layout(hovermode='x')
fig.show()

patientData = pd.read_csv('covid/patientinfo.csv')
groupPatientData = patientData.groupby('confirmed_date').size().reset_index()
groupPatientData.columns = ['confirmed_date', 'count']

fig = px.line(groupPatientData, x="confirmed_date", y="count", title='Numbers Of Covid Cases Per Day')
fig.update_layout(hovermode='x')
fig.show()
policy_id country type gov_policy detail start_date end_date
0 1 Korea Alert Infectious Disease Alert Level Level 1 (Blue) 2020-01-03 2020-01-19
1 2 Korea Alert Infectious Disease Alert Level Level 2 (Yellow) 2020-01-20 2020-01-27
2 3 Korea Alert Infectious Disease Alert Level Level 3 (Orange) 2020-01-28 2020-02-22
3 4 Korea Alert Infectious Disease Alert Level Level 4 (Red) 2020-02-23 2020-09-21
4 5 Korea Immigration Special Immigration Procedure from China 2020-02-04 2020-09-21

Time

In [161]:
timeData = pd.read_csv('covid/time.csv')
timeDataMelted = pd.melt(timeData, id_vars=['date'], value_vars=['test', 'negative','confirmed','released','deceased'])
display(timeDataMelted.head())

fig = px.line(timeDataMelted, x="date", y="value", color='variable', title="Overall cases over time")
fig.update_layout(hovermode='x')
fig.show()
date variable value
0 2020-01-20 test 1
1 2020-01-21 test 1
2 2020-01-22 test 4
3 2020-01-23 test 22
4 2020-01-24 test 27
In [25]:
timeDataMelted = pd.melt(timeData, id_vars=['date'], value_vars=['confirmed','released','deceased'])
display(timeDataMelted.head())

fig = px.line(timeDataMelted, x="date", y="value", color='variable', title="Confirmed, released & deceased over time")
fig.update_layout(hovermode='x')
fig.show()
date variable value
0 2020-01-20 confirmed 1
1 2020-01-21 confirmed 1
2 2020-01-22 confirmed 1
3 2020-01-23 confirmed 1
4 2020-01-24 confirmed 2
In [26]:
miniTimeData = timeData[['date','test','negative']]
miniTimeData['Test increase'] =  miniTimeData['test'] - miniTimeData['test'].shift(1) 
miniTimeData['Negative increase'] =  miniTimeData['negative'] - miniTimeData['negative'].shift(1) 
miniTimeData['Month'] = pd.to_datetime(miniTimeData['date']).dt.month
miniTimeData['Day'] = pd.to_datetime(miniTimeData['date']).dt.day_name()
display(miniTimeData.head())
date test negative Test increase Negative increase Month Day
0 2020-01-20 1 0 NaN NaN 1 Monday
1 2020-01-21 1 0 0.0 0.0 1 Tuesday
2 2020-01-22 4 3 3.0 3.0 1 Wednesday
3 2020-01-23 22 21 18.0 18.0 1 Thursday
4 2020-01-24 27 25 5.0 4.0 1 Friday

Test & Negative cases daily average increase amount by day

In [27]:
miniTimeDataDay = miniTimeData[['Test increase', 'Negative increase', 'Day']]
miniTimeDataDay = miniTimeDataDay.groupby('Day')['Test increase', 'Negative increase'].mean()
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
miniTimeDataDay = miniTimeDataDay.reindex(cats) 
miniTimeDataDay = miniTimeDataDay.reset_index()
miniTimeDataDayMelted = pd.melt(miniTimeDataDay, id_vars=['Day'], value_vars=['Test increase','Negative increase'])
display(miniTimeDataDayMelted.head())

fig = px.bar(miniTimeDataDayMelted, x='Day', y='value', color='variable', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Day variable value
0 Monday Test increase 4852.565217
1 Tuesday Test increase 9753.958333
2 Wednesday Test increase 8594.391304
3 Thursday Test increase 8632.000000
4 Friday Test increase 9373.869565

Test & Negative cases daily average increase amount by month

In [28]:
miniTimeDataMonth = miniTimeData[['Test increase', 'Negative increase', 'Month']]
miniTimeDataMonth = miniTimeDataMonth.groupby('Month')['Test increase', 'Negative increase'].mean()
miniTimeDataMonth = miniTimeDataMonth.reset_index()
miniTimeDataMonthMelted = pd.melt(miniTimeDataMonth, id_vars=['Month'], value_vars=['Test increase','Negative increase'])
display(miniTimeDataMonthMelted.head())

fig = px.bar(miniTimeDataMonthMelted, x='Month', y='value', color='variable', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Month variable value
0 1 Test increase 28.272727
1 2 Test increase 3232.517241
2 3 Test increase 10209.967742
3 4 Test increase 6977.233333
4 5 Test increase 9385.193548
In [29]:
miniTimeData = timeData[['date','confirmed','released','deceased']]
miniTimeData['confirmed increase'] =  miniTimeData['confirmed'] - miniTimeData['confirmed'].shift(1) 
miniTimeData['released increase'] =  miniTimeData['released'] - miniTimeData['released'].shift(1) 
miniTimeData['deceased increase'] =  miniTimeData['deceased'] - miniTimeData['deceased'].shift(1) 
miniTimeData['Month'] = pd.to_datetime(miniTimeData['date']).dt.month
miniTimeData['Day'] = pd.to_datetime(miniTimeData['date']).dt.day_name()
display(miniTimeData.head())
date confirmed released deceased confirmed increase released increase deceased increase Month Day
0 2020-01-20 1 0 0 NaN NaN NaN 1 Monday
1 2020-01-21 1 0 0 0.0 0.0 0.0 1 Tuesday
2 2020-01-22 1 0 0 0.0 0.0 0.0 1 Wednesday
3 2020-01-23 1 0 0 0.0 0.0 0.0 1 Thursday
4 2020-01-24 2 0 0 1.0 0.0 0.0 1 Friday

Confirmed, released & deceased cases daily average increase amount by day

In [30]:
miniTimeDataDay = miniTimeData[['confirmed increase', 'released increase','deceased increase', 'Day']]
miniTimeDataDay = miniTimeDataDay.groupby('Day')['confirmed increase', 'released increase','deceased increase'].mean()
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
miniTimeDataDay = miniTimeDataDay.reindex(cats) 
miniTimeDataDay = miniTimeDataDay.reset_index()
miniTimeDataDayMelted = pd.melt(miniTimeDataDay, id_vars=['Day'], value_vars=['confirmed increase', 'released increase','deceased increase'])
display(miniTimeDataDayMelted.head())

fig = px.bar(miniTimeDataDayMelted, x='Day', y='value', color='variable', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Day variable value
0 Monday confirmed increase 67.217391
1 Tuesday confirmed increase 63.375000
2 Wednesday confirmed increase 76.130435
3 Thursday confirmed increase 79.565217
4 Friday confirmed increase 84.086957

Confirmed, released & deceased cases daily average increase amount by month

In [31]:
miniTimeDataMonth = miniTimeData[['confirmed increase', 'released increase','deceased increase', 'Month']]
miniTimeDataMonth = miniTimeDataMonth.groupby('Month')['confirmed increase', 'released increase','deceased increase'].mean()
miniTimeDataMonth = miniTimeDataMonth.reset_index()
miniTimeDataMonthMelted = pd.melt(miniTimeDataMonth, id_vars=['Month'], value_vars=['confirmed increase', 'released increase','deceased increase'])
display(miniTimeDataMonthMelted.head())

fig = px.bar(miniTimeDataMonthMelted, x='Month', y='value', color='variable', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Month variable value
0 1 confirmed increase 0.909091
1 2 confirmed increase 108.241379
2 3 confirmed increase 214.064516
3 4 confirmed increase 32.633333
4 5 confirmed increase 22.677419

TimeAge

In [32]:
timeAgeData = pd.read_csv('covid/timeage.csv')
display(timeAgeData.head())

fig = px.line(timeAgeData, x="date", y="confirmed", color='age', title="Confirmed cases of various age group over time")
fig.update_layout(hovermode='x')
fig.show()
date time age confirmed deceased
0 2020-03-02 0 0s 32 0
1 2020-03-02 0 10s 169 0
2 2020-03-02 0 20s 1235 0
3 2020-03-02 0 30s 506 1
4 2020-03-02 0 40s 633 1
In [33]:
timeAgeData = pd.read_csv('covid/timeage.csv')
fig = px.line(timeAgeData, x="date", y="deceased", color='age', title="Deceased cases of various age group over time")
fig.update_layout(hovermode='x')
fig.show()

Deceased vs confirmed cases across different age group over time

In [145]:
timeAgeData = pd.read_csv('covid/timeage.csv')
timeAgeData['month'] = pd.to_datetime(timeAgeData['date']).dt.month
display(timeAgeData.head())

fig = px.scatter(timeAgeData, 
                 x='confirmed', 
                 y='deceased', 
                 color="age",
                 size_max=100,
                 size="deceased",
                 animation_frame="date", 
                 animation_group="age",
                 range_x=[0, timeAgeData['confirmed'].max()], 
                 range_y=[0, timeAgeData['deceased'].max() + timeAgeData['deceased'].std()]
)

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 200
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 200
fig.layout.coloraxis.showscale = False
fig.layout.sliders[0].pad.t = 10
fig.layout.updatemenus[0].pad.t= 10
fig.show()
date time age confirmed deceased month
0 2020-03-02 0 0s 32 0 3
1 2020-03-02 0 10s 169 0 3
2 2020-03-02 0 20s 1235 0 3
3 2020-03-02 0 30s 506 1 3
4 2020-03-02 0 40s 633 1 3

Time Gender

In [35]:
timeGender = pd.read_csv('covid/timegender.csv')
display(timeGender.head())

fig = px.line(timeGender, x="date", y="confirmed", color='sex', title="Confirmed cases between genders over time")
fig.update_layout(hovermode='x')
fig.show()
date time sex confirmed deceased
0 2020-03-02 0 male 1591 13
1 2020-03-02 0 female 2621 9
2 2020-03-03 0 male 1810 16
3 2020-03-03 0 female 3002 12
4 2020-03-04 0 male 1996 20
In [36]:
fig = px.line(timeGender, x="date", y="deceased", color='sex', title="Deceased cases between genders over time")
fig.update_layout(hovermode='x')
fig.show()

Deceased vs confirmed cases across different gender over time

In [170]:
timeGender = pd.read_csv('covid/timegender.csv')
display(timeGender.head())

fig = px.scatter(timeGender, 
                 x='confirmed', 
                 y='deceased', 
                 color="sex",
                 size_max=100,
                 size="deceased",
                 animation_frame="date", 
                 animation_group="sex",
                 range_x=[0, timeGender['confirmed'].max() + timeGender['confirmed'].std()], 
                 range_y=[0, timeGender['deceased'].max() + timeGender['deceased'].std()]
)

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 200
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 200
fig.layout.coloraxis.showscale = False
fig.layout.sliders[0].pad.t = 10
fig.layout.updatemenus[0].pad.t= 10
fig.show()
date time sex confirmed deceased
0 2020-03-02 0 male 1591 13
1 2020-03-02 0 female 2621 9
2 2020-03-03 0 male 1810 16
3 2020-03-03 0 female 3002 12
4 2020-03-04 0 male 1996 20

Confirmed cases daily average increase amount by month (Sort by gender)

In [37]:
timeGender = pd.read_csv('covid/timegender.csv')
timeGenderAnalysis = timeGender.copy()

timeGenderAnalysis['confirmed increase'] = timeGenderAnalysis['confirmed'] - timeGenderAnalysis['confirmed'].shift(2)
timeGenderAnalysis['deceased increase'] = timeGenderAnalysis['deceased'] - timeGenderAnalysis['deceased'].shift(2)

timeGenderAnalysis['month'] = pd.to_datetime(timeGenderAnalysis['date']).dt.month
timeGenderAnalysis['day'] = pd.to_datetime(timeGenderAnalysis['date']).dt.day_name()

timeGenderAnalysisMonth = timeGenderAnalysis.groupby(['month','sex'])['confirmed increase'].mean()
timeGenderAnalysisMonth = timeGenderAnalysisMonth.reset_index()

timeGenderAnalysisMonthMelted = pd.melt(timeGenderAnalysisMonth, id_vars=['month','sex'], value_vars=['confirmed increase'])
display(timeGenderAnalysisMonthMelted.head())

fig = px.bar(timeGenderAnalysisMonthMelted, x='month', y='value', color='sex', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
month sex variable value
0 3 female confirmed increase 112.413793
1 3 male confirmed increase 79.793103
2 4 female confirmed increase 17.733333
3 4 male confirmed increase 14.900000
4 5 female confirmed increase 8.387097

Confirmed cases daily average increase amount by day (Sort by gender)

In [38]:
timeGenderAnalysisMonth = timeGenderAnalysis.groupby(['day','sex'])['confirmed increase'].mean()
timeGenderAnalysisMonth = timeGenderAnalysisMonth.reset_index()

weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
timeGenderAnalysisMonth['day'] = pd.Categorical(timeGenderAnalysisMonth['day'],categories=weekdays)
timeGenderAnalysisMonth = timeGenderAnalysisMonth.sort_values('day')

timeGenderAnalysisMonthMelted = pd.melt(timeGenderAnalysisMonth, id_vars=['day','sex'], value_vars=['confirmed increase'])
display(timeGenderAnalysisMonthMelted.head())

fig = px.bar(timeGenderAnalysisMonthMelted, x='day', y='value', color='sex', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
day sex variable value
0 Monday female confirmed increase 25.352941
1 Monday male confirmed increase 21.764706
2 Tuesday female confirmed increase 41.500000
3 Tuesday male confirmed increase 34.833333
4 Wednesday female confirmed increase 48.941176

Deceased cases daily average increase amount by month (Sort by gender)

In [39]:
timeGenderAnalysisMonth = timeGenderAnalysis.groupby(['month','sex'])['deceased increase'].mean()
timeGenderAnalysisMonth = timeGenderAnalysisMonth.reset_index()

timeGenderAnalysisMonthMelted = pd.melt(timeGenderAnalysisMonth, id_vars=['month','sex'], value_vars=['deceased increase'])
display(timeGenderAnalysisMonthMelted.head())

fig = px.bar(timeGenderAnalysisMonthMelted, x='month', y='value', color='sex', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
month sex variable value
0 3 female deceased increase 2.448276
1 3 male deceased increase 2.379310
2 4 female deceased increase 1.233333
3 4 male deceased increase 1.600000
4 5 female deceased increase 0.322581

Deceased cases daily average increase amount by day (Sort by gender)

In [40]:
timeGenderAnalysisMonth = timeGenderAnalysis.groupby(['day','sex'])['deceased increase'].mean()
timeGenderAnalysisMonth = timeGenderAnalysisMonth.reset_index()

weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
timeGenderAnalysisMonth['day'] = pd.Categorical(timeGenderAnalysisMonth['day'],categories=weekdays)
timeGenderAnalysisMonth = timeGenderAnalysisMonth.sort_values('day')

timeGenderAnalysisMonthMelted = pd.melt(timeGenderAnalysisMonth, id_vars=['day','sex'], value_vars=['deceased increase'])
display(timeGenderAnalysisMonthMelted.head())

fig = px.bar(timeGenderAnalysisMonthMelted, x='day', y='value', color='sex', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
day sex variable value
0 Monday female deceased increase 0.882353
1 Monday male deceased increase 0.823529
2 Tuesday female deceased increase 1.666667
3 Tuesday male deceased increase 1.166667
4 Wednesday female deceased increase 0.882353

Time Province

In [126]:
timeProvince = pd.read_csv('covid/timeprovince.csv')
display(timeProvince.head())

fig = px.line(timeProvince, x="date", y="confirmed", color='province', title="Confirmed cases of various province over time")
fig.update_layout(hovermode='x')


fig.show()
date time province confirmed released deceased
0 2020-01-20 16 Seoul 0 0 0
1 2020-01-20 16 Busan 0 0 0
2 2020-01-20 16 Daegu 0 0 0
3 2020-01-20 16 Incheon 1 0 0
4 2020-01-20 16 Gwangju 0 0 0
In [42]:
fig = px.line(timeProvince, x="date", y="released", color='province', title="Released cases of various province over time")
fig.update_layout(hovermode='x')
fig.show()
In [43]:
fig = px.line(timeProvince, x="date", y="deceased", color='province', title="Deceased cases of various province over time")
fig.update_layout(hovermode='x')
fig.show()

Deceased vs confirmed cases across different province over time (With Daegu)

In [172]:
timeProvince = pd.read_csv('covid/timeprovince.csv')
display(timeProvince.head())

fig = px.scatter(timeProvince, 
                 y='released', 
                 x='confirmed', 
                 color="province",
                 size_max=100,
                 size="deceased",
                 animation_frame="date", 
                 animation_group="province",
                 range_y=[0, timeProvince['confirmed'].max() + timeProvince['confirmed'].std()], 
                 range_x=[0, timeProvince['released'].max() + timeProvince['released'].std()]
)

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 200
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 200
fig.layout.coloraxis.showscale = False
fig.layout.sliders[0].pad.t = 10
fig.layout.updatemenus[0].pad.t= 10
fig.show()
date time province confirmed released deceased
0 2020-01-20 16 Seoul 0 0 0
1 2020-01-20 16 Busan 0 0 0
2 2020-01-20 16 Daegu 0 0 0
3 2020-01-20 16 Incheon 1 0 0
4 2020-01-20 16 Gwangju 0 0 0

Deceased vs confirmed cases across different province over time (Without Daegu)

In [175]:
timeProvince = pd.read_csv('covid/timeprovince.csv')
timeProvince = timeProvince[timeProvince['province'] != 'Daegu']
display(timeProvince.head())

fig = px.scatter(timeProvince, 
                 y='released', 
                 x='confirmed', 
                 color="province",
                 size_max=100,
                 size="deceased",
                 animation_frame="date", 
                 animation_group="province",
                 range_y=[0, timeProvince['confirmed'].max() + timeProvince['confirmed'].std()], 
                 range_x=[0, timeProvince['released'].max() + timeProvince['released'].std()]
)

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 200
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 200
fig.layout.coloraxis.showscale = False
fig.layout.sliders[0].pad.t = 10
fig.layout.updatemenus[0].pad.t= 10
fig.show()
date time province confirmed released deceased
0 2020-01-20 16 Seoul 0 0 0
1 2020-01-20 16 Busan 0 0 0
3 2020-01-20 16 Incheon 1 0 0
4 2020-01-20 16 Gwangju 0 0 0
5 2020-01-20 16 Daejeon 0 0 0

In depth Analysis Of Daegu Cases using Technical Analysis.

In [44]:
dataForDaegu = timeProvince[timeProvince['province'] == 'Daegu']
dataForDaegu = dataForDaegu[['date','province','confirmed']]
dataForDaegu['increasePerDay'] = dataForDaegu['confirmed'] - dataForDaegu['confirmed'].shift(1)
dataForDaegu = dataForDaegu[['date','province','increasePerDay']]
dataForDaegu = dataForDaegu.rename({'increasePerDay':'Increase Per Day'}, axis=1)
display(dataForDaegu.head())

fig = px.line(dataForDaegu, x="date", y="Increase Per Day", title="Daegu Cases Per Day")
fig.update_layout(hovermode='x')
fig.show()
date province Increase Per Day
2 2020-01-20 Daegu NaN
19 2020-01-21 Daegu 0.0
36 2020-01-22 Daegu 0.0
53 2020-01-23 Daegu 0.0
70 2020-01-24 Daegu 0.0

Daeugu's Cases is most active during Feb to Apr period.

Technical Analysis

In [45]:
# Uncommend to focus data on active period
# dataForDaegu = dataForDaegu.set_index('date')
# dataForDaegu = dataForDaegu['2020-02-01' : '2020-04-31'].reset_index()
dataForDaegu['10 Days Moving Average'] = dataForDaegu['Increase Per Day'].rolling(10).mean()
dataForDaegu['Upper Limit 10 day Bollinger Band']  = dataForDaegu['Increase Per Day'].rolling(10).mean() + (dataForDaegu['Increase Per Day'].rolling(10).std() * 2)
dataForDaegu['Lower Limit 10 day Bollinger Band']  = dataForDaegu['Increase Per Day'].rolling(10).mean() - (dataForDaegu['Increase Per Day'].rolling(10).std() * 2)
dataForDaeguNew = pd.melt(dataForDaegu, id_vars=['date'], value_vars=['Increase Per Day', '10 Days Moving Average','Upper Limit 10 day Bollinger Band','Lower Limit 10 day Bollinger Band'])
display(dataForDaeguNew.head())

fig = px.line(dataForDaeguNew, x="date", y="value", color='variable', title="Daegu Cases Per Day")
fig['data'][2]['line']['color']="rgba(147,212,219,0.8)"
fig['data'][3]['line']['color']="rgba(157,212,219,0.8)"
fig.update_layout(hovermode='x')
fig.show()
date variable value
0 2020-01-20 Increase Per Day NaN
1 2020-01-21 Increase Per Day 0.0
2 2020-01-22 Increase Per Day 0.0
3 2020-01-23 Increase Per Day 0.0
4 2020-01-24 Increase Per Day 0.0

Stats By Month

In [46]:
dataForDaeguNew['date'] = pd.to_datetime(dataForDaeguNew['date'])
dataForDaeguNew['Month'] = dataForDaeguNew['date'].dt.month 
dataForDaeguNew['Day'] = dataForDaeguNew['date'].dt.day_name()
dataForDaeguMonth = dataForDaeguNew[['variable','Month']]
dataForDaeguMonth = dataForDaeguNew.groupby(['Month','variable'])['value'].mean()
dataForDaeguMonth = dataForDaeguMonth.reset_index()
display(dataForDaeguMonth.head())

fig = px.bar(dataForDaeguMonth, x='Month', y='value', color='variable', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Month variable value
0 1 10 Days Moving Average 0.000000
1 1 Increase Per Day 0.000000
2 1 Lower Limit 10 day Bollinger Band 0.000000
3 1 Upper Limit 10 day Bollinger Band 0.000000
4 2 10 Days Moving Average 25.265517

Stats By Day

In [47]:
dataForDaeguMonth = dataForDaeguNew[['variable','Day']]
dataForDaeguMonth = dataForDaeguNew.groupby(['Day','variable'])['value'].mean()
dataForDaeguMonth = dataForDaeguMonth.reset_index()


weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
dataForDaeguMonth['Day'] = pd.Categorical(dataForDaeguMonth['Day'],categories=weekdays)
dataForDaeguMonth = dataForDaeguMonth.sort_values('Day')

display(dataForDaeguMonth.head())

fig = px.bar(dataForDaeguMonth, x='Day', y='value', color='variable', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Day variable value
4 Monday 10 Days Moving Average 46.068182
5 Monday Increase Per Day 36.434783
6 Monday Lower Limit 10 day Bollinger Band -8.111102
7 Monday Upper Limit 10 day Bollinger Band 100.247465
23 Tuesday Upper Limit 10 day Bollinger Band 92.607688

Time Series World Map Visualisation of COVID-19 cases in korea

In [48]:
timeProvince = pd.read_csv('covid/timeprovince.csv')

data = {'province': ['Seoul', 'Busan', 'Daegu', 'Incheon', 'Gwangju', 'Daejeon',
       'Ulsan', 'Sejong', 'Gyeonggi-do', 'Gangwon-do',
       'Chungcheongbuk-do', 'Chungcheongnam-do', 'Jeollabuk-do',
       'Jeollanam-do', 'Gyeongsangbuk-do', 'Gyeongsangnam-do', 'Jeju-do'], 
        'longitude': [127.047325,129.066666,128.600006,126.705208,126.916664,127.385002,
                     129.316666,127.2822,127.143738,127.920158,
                     127.935905,126.979874,126.916664,126.9910,
                      129.263885,128.429581,126.5312],
        'latitude': [37.517235,35.166668,35.866669,37.456257,35.166668,36.351002,
                    35.549999,36.4870,37.603405,37.342220,
                    36.981304,36.806702,35.166668,34.8679,
                     35.835354,34.855228,33.4996]}
location = pd.DataFrame(data=data)

mergedData = timeProvince.merge(location, on='province', how='left')
mergedData = mergedData.dropna()
mergedData.head()
Out[48]:
date time province confirmed released deceased longitude latitude
0 2020-01-20 16 Seoul 0 0 0 127.047325 37.517235
1 2020-01-20 16 Busan 0 0 0 129.066666 35.166668
2 2020-01-20 16 Daegu 0 0 0 128.600006 35.866669
3 2020-01-20 16 Incheon 1 0 0 126.705208 37.456257
4 2020-01-20 16 Gwangju 0 0 0 126.916664 35.166668

With Animation

In [49]:
fig = px.scatter_mapbox(
    mergedData, lat="latitude", lon="longitude",
    size="confirmed", size_max=100,
    color="deceased", color_continuous_scale=px.colors.sequential.Burg,
    hover_name="province",           
    mapbox_style='dark', zoom=6,
    animation_frame="date", animation_group="province"
)

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 200
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 200
fig.layout.coloraxis.showscale = False
fig.layout.sliders[0].pad.t = 10
fig.layout.updatemenus[0].pad.t= 10

fig.show()

Without animation

In [50]:
fig = px.scatter_mapbox(
    mergedData, lat="latitude", lon="longitude",
    size="confirmed", size_max=100,
    color="deceased", color_continuous_scale=px.colors.sequential.Burg,
    hover_name="province",           
    mapbox_style='dark', zoom=6,
)

fig.show()

Weather

In [51]:
weatherData = pd.read_csv('covid/weather.csv')
weatherData = weatherData.set_index('date')
weatherData = weatherData['2020-01-01' : '2020-08-31'].reset_index()
display(weatherData.head())

weatherDataForHeatmap = weatherData.pivot("date", "province", "avg_temp")
f, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(weatherDataForHeatmap, cmap="RdPu", annot=False, fmt="d", linewidths=.5, ax=ax).set_title('Weather Over Time')
date code province avg_temp min_temp max_temp precipitation max_wind_speed most_wind_direction avg_relative_humidity
0 2020-01-01 10000 Seoul -2.2 -6.5 0.3 0.0 2.6 50.0 64.4
1 2020-01-01 11000 Busan 1.9 -3.2 7.8 0.0 5.1 340.0 44.0
2 2020-01-01 12000 Daegu 0.2 -4.9 4.6 0.0 5.6 270.0 53.3
3 2020-01-01 13000 Gwangju -0.3 -4.9 5.7 0.0 4.3 50.0 58.0
4 2020-01-01 14000 Incheon -1.4 -5.4 1.9 0.0 3.8 160.0 66.6
Out[51]:
Text(0.5, 1, 'Weather Over Time')

Busan Avg Temp (Mean) vs Most Wind Direction (Mean) Analysis

In [52]:
high_direction_province = weatherData.copy()
high_direction_province['month'] = high_direction_province['date'].apply(pd.to_datetime).dt.month
high_direction_province.head()
Out[52]:
date code province avg_temp min_temp max_temp precipitation max_wind_speed most_wind_direction avg_relative_humidity month
0 2020-01-01 10000 Seoul -2.2 -6.5 0.3 0.0 2.6 50.0 64.4 1
1 2020-01-01 11000 Busan 1.9 -3.2 7.8 0.0 5.1 340.0 44.0 1
2 2020-01-01 12000 Daegu 0.2 -4.9 4.6 0.0 5.6 270.0 53.3 1
3 2020-01-01 13000 Gwangju -0.3 -4.9 5.7 0.0 4.3 50.0 58.0 1
4 2020-01-01 14000 Incheon -1.4 -5.4 1.9 0.0 3.8 160.0 66.6 1
In [53]:
high_direction_province_grouped = high_direction_province.groupby(['province','month']).agg({'avg_temp':['mean','std','min','max'],
                                                                                             'min_temp':['mean','std'],
                                                                                             'max_temp':['mean','std'],
                                                                                             'max_wind_speed':['mean','std'],
                                                                                             'most_wind_direction':['mean','std'],
                                                                                             'avg_relative_humidity':['mean','std']})
print('Agg Data')
display(high_direction_province_grouped.head())

high_direction_province_grouped = high_direction_province_grouped.stack().stack()
high_direction_province_grouped = high_direction_province_grouped.reset_index()
high_direction_province_grouped.columns = ['Province','Month', 'Measurement Type','Weather Type','Value']
high_direction_province_grouped = high_direction_province_grouped.set_index(['Weather Type','Measurement Type','Province','Month'])

print('Indexed Data')
display(high_direction_province_grouped.head())
Agg Data
avg_temp min_temp max_temp max_wind_speed most_wind_direction avg_relative_humidity
mean std min max mean std mean std mean std mean std mean std
province month
Busan 1 6.412903 2.652514 1.9 14.4 3.135484 2.938316 10.883871 2.618154 6.245161 2.547788 212.580645 132.362324 56.351613 18.844537
2 7.137931 3.514197 0.1 12.8 3.113793 4.129141 11.837931 3.533858 6.344828 2.403507 223.448276 121.428282 55.134483 18.707929
3 10.441935 2.377782 5.1 14.5 6.438710 2.874332 14.961290 2.177411 7.125806 2.507850 163.333333 100.458718 56.906452 16.644397
4 12.620000 2.059193 9.0 16.9 8.896667 2.074390 17.543333 3.003295 7.580000 2.217237 180.333333 96.614532 52.570000 15.572681
5 17.877419 1.552355 14.9 20.2 14.977419 1.664073 21.790323 2.448313 6.167742 2.057731 135.806452 74.867625 72.483871 14.036930
Indexed Data
Value
Weather Type Measurement Type Province Month
avg_temp max Busan 1 14.400000
avg_relative_humidity mean Busan 1 56.351613
avg_temp mean Busan 1 6.412903
max_temp mean Busan 1 10.883871
max_wind_speed mean Busan 1 6.245161
In [54]:
busan_data = high_direction_province_grouped.loc[(['avg_temp','most_wind_direction'],slice(None),'Busan'), :].sort_values(['Weather Type','Month'])
busan_data.head()
Out[54]:
Value
Weather Type Measurement Type Province Month
avg_temp max Busan 1 14.400000
mean Busan 1 6.412903
min Busan 1 1.900000
std Busan 1 2.652514
max Busan 2 12.800000
In [55]:
busan_data_mean = high_direction_province_grouped.loc[(['avg_temp','most_wind_direction'],'mean','Busan'), :].sort_values(['Weather Type','Month']).reset_index()
display(busan_data_mean.head())

fig = px.bar(busan_data_mean, x='Month', y='Value', color='Weather Type', barmode='group')
fig.update_layout(hovermode='x')
fig.show()
Weather Type Measurement Type Province Month Value
0 avg_temp mean Busan 1 6.412903
1 avg_temp mean Busan 2 7.137931
2 avg_temp mean Busan 3 10.441935
3 avg_temp mean Busan 4 12.620000
4 avg_temp mean Busan 5 17.877419
In [56]:
busan_temp = busan_data_mean[busan_data_mean['Weather Type'] == 'avg_temp']['Value']
busan_temp = pd.DataFrame(busan_temp).reset_index().drop('index',axis=1)
busan_temp.columns = ['avg_temp']
# display(busan_temp)

busan_wind_direction = busan_data_mean[busan_data_mean['Weather Type'] == 'most_wind_direction']['Value']
busan_wind_direction = pd.DataFrame(busan_wind_direction).reset_index().drop('index',axis=1)
busan_wind_direction.columns = ['most_wind_direction']
# display(busan_wind_direction)

plot_table = busan_wind_direction.join(busan_temp)
display(plot_table)

fig = px.scatter(plot_table, x="most_wind_direction", y="avg_temp", trendline="ols", title="Avg Temp & Wind Direction")
fig.update_layout(hovermode='x')
fig.show()
most_wind_direction avg_temp
0 212.580645 6.412903
1 223.448276 7.137931
2 163.333333 10.441935
3 180.333333 12.620000
4 135.806452 17.877419
5 140.689655 22.448276

Does other region has the same relationship between avg_temp and most_wind direction?

In [57]:
all_data = high_direction_province_grouped.loc[(['avg_temp','most_wind_direction'],'mean',slice(None),slice(None)), :].sort_values(['Weather Type','Month'])
all_data = all_data.reset_index()

all_data_temp = all_data[all_data['Weather Type'] == 'avg_temp']['Value']
all_data_temp = all_data_temp.reset_index()
all_data_temp = all_data_temp.drop('index',axis=1)
all_data_temp.columns = ['temp']
# all_data_temp.head()

all_data_wind = all_data[all_data['Weather Type'] == 'most_wind_direction']['Value']
all_data_wind = all_data_wind.reset_index()
all_data_wind = all_data_wind.drop('index',axis=1)
all_data_wind.columns = ['wind']
# all_data_wind.head()

all_data_plot = all_data_temp.join(all_data_wind)
display(all_data_plot.head())

fig = px.scatter(all_data_plot, x="wind", y="temp", trendline="ols", title="Avg Temp & Wind Direction")
fig.update_layout(hovermode='x')
fig.show()
temp wind
0 6.412903 212.580645
1 0.609677 126.451613
2 1.770968 152.258065
3 3.764516 213.870968
4 2.706452 246.129032

Region

In [58]:
regionData = pd.read_csv('covid/region.csv')
pd.set_option('display.max_rows', regionData.shape[0]+1)
display(regionData.head())

fig = px.scatter_mapbox(
    regionData[regionData.city != 'Korea'], 
    text="city",
    lat="latitude", 
    lon="longitude",     
    color="elderly_population_ratio", 
    size="nursing_home_count",
    color_continuous_scale=px.colors.sequential.Burg, 
    size_max=100, 
    zoom=6,
    title="Number of nusing home and elder population ratio across Korea")
fig.show()
code province city latitude longitude elementary_school_count kindergarten_count university_count academy_ratio elderly_population_ratio elderly_alone_ratio nursing_home_count
0 10000 Seoul Seoul 37.566953 126.977977 607 830 48 1.44 15.38 5.8 22739
1 10010 Seoul Gangnam-gu 37.518421 127.047222 33 38 0 4.18 13.17 4.3 3088
2 10020 Seoul Gangdong-gu 37.530492 127.123837 27 32 0 1.54 14.55 5.4 1023
3 10030 Seoul Gangbuk-gu 37.639938 127.025508 14 21 0 0.67 19.49 8.5 628
4 10040 Seoul Gangseo-gu 37.551166 126.849506 36 56 1 1.17 14.39 5.7 1080

Web Scalping: Comparsion across the world

In [59]:
page = requests.get("https://www.worldometers.info/coronavirus")
page.status_code
Out[59]:
200
In [60]:
# page.content
soup = BeautifulSoup(page.content, 'lxml')
In [61]:
# print(soup.prettify())
In [62]:
covidTable = soup.find('table', attrs={'id': 'main_table_countries_today'})
# covidTable

Getting the data from the table

In [63]:
rows = covidTable.find_all("tr", attrs={"style": ""})
In [64]:
covidData = []
for i,data in enumerate(rows):
    if i == 0:
        
        covidData.append(data.text.strip().split("\n")[:13])
        
    else:
        covidData.append(data.text.strip().split("\n")[:12])
# covidData

Covert to data

In [65]:
covidTable = pd.DataFrame(covidData[1:], columns=covidData[0][:12])
covidTable = covidTable[~covidTable['#'].isin(['World', 'Total:'])]
covidTable = covidTable.drop('#', axis =1)
covidTable.head()
Out[65]:
Country,Other TotalCases NewCases TotalDeaths NewDeaths TotalRecovered NewRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop
1 USA 7,004,768 204,118 4,250,140 2,550,510 14,020 21,135 616
2 India 5,487,580 +1,968 87,909 4,396,399 +3,749 1,003,272 8,944 3,968 64
3 Brazil 4,544,629 136,895 3,851,227 556,507 8,318 21,347 643
4 Russia 1,103,399 19,418 909,357 174,624 2,300 7,560 133
5 Peru 768,895 31,369 615,255 122,271 1,425 23,249 948

Worldwide Total Cases Chart

In [66]:
fig = px.bar(covidTable, y='TotalCases', x='Country,Other')
fig.update_layout(hovermode='x')
fig.show()

We will be using log because the data is highly skewed

In [67]:
covidTable['logTotalCases'] = np.log(covidTable['TotalCases'].str.replace(r'\D', '').astype(int))
covidTable = covidTable.sort_values(by=['logTotalCases'])
fig = px.bar(covidTable, y='logTotalCases', x='Country,Other')
fig.update_layout(hovermode='x')
fig.show()

Worldwide Total Death Chart

In [68]:
pd.set_option('display.max_rows', covidTable.shape[0]+1)
covidTable['TotalDeaths'].str.strip()
covidTable['TotalDeaths'].replace('', np.nan, inplace=True)
covidTable['TotalDeaths'].replace(' ', np.nan, inplace=True)
covidTable['TotalDeaths'] = covidTable['TotalDeaths'].fillna(0)

covidTable['logTotalDeath'] = np.log(covidTable['TotalDeaths'].str.replace(r'\D', '').astype(float))
covidTable = covidTable.sort_values(by=['logTotalDeath'])
display(covidTable.head())

fig = px.bar(covidTable, y='logTotalDeath', x='Country,Other')
fig.update_layout(hovermode='x')
fig.show()
Country,Other TotalCases NewCases TotalDeaths NewDeaths TotalRecovered NewRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop logTotalCases logTotalDeath
202 Western Sahara 10 1 8 1 17 2 2.302585 0.0
174 Burundi 473 1 462 10 40 0.08 6.159095 0.0
194 Caribbean Netherlands 36 1 17 18 1,370 38 3.583519 0.0
193 British Virgin Islands 69 1 48 20 2 2,279 33 4.234107 0.0
184 Curaçao 268 1 96 171 1 1,632 6 5.590987 0.0

Scatterplot between total cases and total death

In [69]:
fig = px.scatter(covidTable, x="logTotalDeath", y="logTotalCases", trendline="ols", title="Total Cases againist Total Death (Log)")
fig.update_layout(hovermode='x')
fig.show()

Summarise Key findings from Exploratory Data Analysis

Overall Cases in Korea

  • On average, there are 100 COVID-19 cases in infected city.
  • The numbers are highly skewed by Nam-gu, a province of Daegu. Nam-gu has about 4500 cases.
  • Most of the clusters are in the province of Daegu, Gyeonggi-do and Seoul

Cases across various Korea province

  • Majority of the province has more individual cases of COVID-19 than group cases with the exception of Incheon and Ulsan
  • Seoul, Daegu, Gyeonggi-do and Gyeongsangbuk-do has the most number of individual cases among the Korea provinces, number of cases range from 150 to 200 with the exception of Daegu
  • In addition, the same group of provinces has the most number of grouped cases, number of cases range from 50-100 with the exception of Daegu

Source of infection

  • Sources of infection mainly come from Shincheonji Church, Itawon Clubs, Contact with Patients and Overseas inflow
  • Most of the infection sources across various province has less than 500 cases with the exception of Daegu.
  • Daegu has more than 4000 cases from Shincheonji Church and about 2000 cases from contact with patient and other cases each.

Patient Cases

  • There is a quick spike in daily COVID-19 cases from Feb to Mar before slowing dripping down to single digit in May. There are double digit COVID-19 cases in June

Recovery Speed

  • Age has an R-squared value of 0.04 for male and 0.03 for female against total treatment days.
  • At first graze, age seems to have little to no effect on recovery speed.
  • However when we used the mean recovery days for each age group, it was found that younger people recovered fast from COVID-19 as compared to older people.
  • Average days to recover for young people (<=30) is around 25 days and below. Middle-aged people(30-60) took about 25-30 days to recover. Older people (>=70) took about 32-36 days to recover
  • Interestingly, infected people in Daegu only conist of people from 50s to 70s. In addition, it took less than 15 days to recover. People in Jeollabuk-do take longer to recover from COVID-19 as compared to the rest of the provinces. Majority of the people took about 20-25 days to recover.

Fatality Speed

  • The average number of treatment day a patient had before he/she passed away is shorter than the average day it took for a patient recover.
  • In addition, younger patient has a lower number of treatment days as compared to older patients. This might indicate that younger patient seek treatment at the later stage of COVID-19.
  • Coincidentally, the number of treatment days increased as the age increases. The trend is similar to the recovery speed among during age group.
  • Daegu has the fastest fatality speed. An average patient is under treatment for less than 10 days before he/she passed away
  • Gyeonggi-do has an outliner. A 30s year old patient died in last than 10 days after seeking treatment.

Percentage of patients recovered, under treatment and deceased

  • Daegu has the highest fatality amount all the provinces. It has a fatality rate of more than 10%. In addition, Daejeon, Gyeongsangbuk-do and Ulsan are the only provinces which have fatailty.
  • Male has a higher fatality rate and a lowest survival rate as compared to female

COVID-19 Tranmission Amount

  • Majority of the cases only spread to 1 other person.
  • Cases #20000000205 is a super spread who spreaded to more than 50 people.
  • In addition, there are less than 10 cases where patient spread to more than 30 people or more.

Policy

  • There is a quick spike in daily COVID-19 cases from Feb to Mar before slowing dripping down to single digit in May. There are double digit COVID-19 cases in June
  • Technology related policy are introduced in Feb, especially thanks to policy like open data, people like us are able to do analysis on Korea COVID-19 cases.
  • After the spike in COVID-19 cases in March, education, social, health and immigration policy are introduced rapidly. Furthermore, the alert level was upgraded to Level 4 red.
  • After the introduction of the policies, the number of COVID-19 cases dropped drastically during April and May. Hence the policies are effective.
  • However, it seems to be lax in control of COVID-19 cases as the COVID-19 slowly creep up to double digit in May/June. This forced the govtnment to introduced administrative policies such as closure of clubs to control the situation.

Cases Over Time Analysis

  • As for 31st Jun 2020, 1.27 million of people has gone for testing, 1.24 million are tested negative, 12.8 thousand of confirmed cases, 11.5 thousand of released cases and 282 people has passed away.
  • Majority of the people are tested negative for covid-19 and less than 1% are tested positive for the virus.
  • Among those who are tested positive, almost 90% of the people has recovered and about 2% of the people has passed away

Cases Over Age & Time Analysis

  • Age group in the 20s 40s 50s 60s has higher infection cases as compared to other age group.
  • Furthermore, in the deceased section, most of the fatalities are in the age group of 60s 70s 80s,
  • This indicate that the risk of death from COVID-19 increases with age.

Cases Over Gender & Time Analysis

  • There are more female infected cases as compared to male cases.
  • The rate of increases of confirmed cases are about the same for both gender over the last 6 months
  • On the other hand, male are more prone to death to COVID-19 as compared to female

Cases Over Province & Time Analysis

  • The province of Daegu, Gyeongsangbuk-do, Incheon and Seoul has the most amount of covid-19 infection cases among the 17 provinces; with Daegu accounting for more than half of the cases.
  • The fatality count is relatively proportional to the infection count with the except of Seoul
  • Bother Seoul and Gyeonggi-do has about 900 infected cases. However, Gyeonggi-do has about 22 fatalities and Seoul has about 7 fatalities.

Weather

  • The weather is about the same across the provinces.
  • The temperature gradually increases from around 5 degree to 25 degree for the period of Jan 2020 to June 2020
  • The weather is relatively the same across all provinces, it has little to do with the reported cases in the short term

Number of nursing home vs Population Ratio

  • We observed an interesting fact regarding the elderly population ratio and the number of nursing home in the city.
  • The elderly population ratio and the number of nursing home has an inverse relationship
  • Perhaps the absolute number of the elder people is smaller where elderly population ratio are high which explains why there is nursing home in places with higher elderly population ratio.

Preparing the data for modelling

We will create a model which will predict whether a COVID-19 patient will survive given a certain set of conditions

  • Sort city & province into different risk level according to number of confirmed cases (Low, Medium, High)
  • Sort age into different age level according to age (kid, adult, elderly)
  • Gender into categorical group of 0 & 1
  • Number of treatment day
  • Status of released, deceased into different category (We will ignore cases who are under treatment)

Get the necessary columns

In [70]:
patientData = pd.read_csv('covid/patientinfo.csv')
patientModellingData = patientData[['sex','age','province','confirmed_date','released_date','deceased_date','state']]

## We removed isolated patient since it is not confirmed if they survived or not
patientModellingData = patientModellingData[patientModellingData['state'] != 'isolated']
nullList = ['sex','age','confirmed_date']
for item in nullList:
     patientModellingData = patientModellingData[~patientModellingData[item].isnull()]
        
display(patientModellingData.head())
sex age province confirmed_date released_date deceased_date state
0 male 50s Seoul 2020-01-23 2020-02-05 NaN released
1 male 30s Seoul 2020-01-30 2020-03-02 NaN released
2 male 50s Seoul 2020-01-30 2020-02-19 NaN released
3 male 20s Seoul 2020-01-30 2020-02-15 NaN released
4 female 20s Seoul 2020-01-31 2020-02-24 NaN released

Calculate number of days of treatment before dropped date column

In [71]:
cols = ['released_date', 'deceased_date', 'confirmed_date']
patientModellingData[cols] = patientModellingData[cols].apply(pd.to_datetime, errors='coerce', axis=1)

def calculate_number_of_treatment_days(row):
    if row["released_date"] is not pd.NaT:
        treatmentDays = pd.to_numeric((row['released_date'] - row['confirmed_date']).days)
        return(treatmentDays)
    elif row["deceased_date"] is not pd.NaT:
        treatmentDays = pd.to_numeric((row['deceased_date'] - row['confirmed_date']).days)
        return(treatmentDays)
    else:
        return None

patientModellingData['Treatment Days'] = patientModellingData.apply(calculate_number_of_treatment_days, axis=1)

patientModellingData = patientModellingData[~patientModellingData['Treatment Days'].isnull()]
patientModellingDataTreatment = patientModellingData[['sex','age','province','state','Treatment Days']]
display(patientModellingDataTreatment.head())
sex age province state Treatment Days
0 male 50s Seoul released 13.0
1 male 30s Seoul released 32.0
2 male 50s Seoul released 20.0
3 male 20s Seoul released 16.0
4 female 20s Seoul released 24.0

Convert state, gender and gender columns to categorical data

In [72]:
genders = {"male": 0, "female": 1}
patientModellingDataTreatment['sex'] = patientModellingDataTreatment['sex'].map(genders)
display(patientModellingDataTreatment.head())
sex age province state Treatment Days
0 0 50s Seoul released 13.0
1 0 30s Seoul released 32.0
2 0 50s Seoul released 20.0
3 0 20s Seoul released 16.0
4 1 20s Seoul released 24.0
In [73]:
state = {"released": 0, "deceased": 1}
patientModellingDataTreatment['state'] = patientModellingDataTreatment['state'].map(state)
display(patientModellingDataTreatment.head())
sex age province state Treatment Days
0 0 50s Seoul 0 13.0
1 0 30s Seoul 0 32.0
2 0 50s Seoul 0 20.0
3 0 20s Seoul 0 16.0
4 1 20s Seoul 0 24.0
In [74]:
age = {'0s':0, '10s':1, '20s':2, '30s':3, '40s':4, '50s':5, '60s':6, '70s':7, '80s':8, '90s':9}
patientModellingDataTreatment['age'] = patientModellingDataTreatment['age'].map(age)
display(patientModellingDataTreatment.head())
sex age province state Treatment Days
0 0 5.0 Seoul 0 13.0
1 0 3.0 Seoul 0 32.0
2 0 5.0 Seoul 0 20.0
3 0 2.0 Seoul 0 16.0
4 1 2.0 Seoul 0 24.0

Get dummy for province

In [75]:
provinceDummy = pd.get_dummies(patientModellingDataTreatment['province'])
provinceDummy
Out[75]:
Chungcheongbuk-do Chungcheongnam-do Daegu Daejeon Gangwon-do Gwangju Gyeonggi-do Gyeongsangbuk-do Gyeongsangnam-do Incheon Jeju-do Jeollabuk-do Jeollanam-do Sejong Seoul Ulsan
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5156 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5157 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5158 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5159 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5160 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

1635 rows × 16 columns

In [76]:
dataForModelling = patientModellingDataTreatment.join(provinceDummy)
dataForModelling = dataForModelling.drop('province',axis=1)
dataForModelling
Out[76]:
sex age state Treatment Days Chungcheongbuk-do Chungcheongnam-do Daegu Daejeon Gangwon-do Gwangju Gyeonggi-do Gyeongsangbuk-do Gyeongsangnam-do Incheon Jeju-do Jeollabuk-do Jeollanam-do Sejong Seoul Ulsan
0 0 5.0 0 13.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 0 3.0 0 32.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
2 0 5.0 0 20.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
3 0 2.0 0 16.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
4 1 2.0 0 24.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5156 0 3.0 0 46.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5157 1 2.0 0 32.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5158 1 1.0 0 12.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5159 1 3.0 0 34.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
5160 1 3.0 0 14.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

1635 rows × 20 columns

Correlation Matrix

In [77]:
plt.figure(figsize=(10,5))
sns.heatmap(dataForModelling.corr(), cmap="RdPu")

dataForModelling['age'] = pd.to_numeric(dataForModelling['age'])

dataForModellingForCleaning = dataForModelling.copy()

display(dataForModellingForCleaning.count())
print(np.any(np.isnan(dataForModellingForCleaning)))
print(np.all(np.isfinite(dataForModellingForCleaning)))

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

clean_dataset(dataForModellingForCleaning)
display(dataForModellingForCleaning.count())
print(np.any(np.isnan(dataForModellingForCleaning)))
print(np.all(np.isfinite(dataForModellingForCleaning)))
sex                  1635
age                  1634
state                1635
Treatment Days       1635
Chungcheongbuk-do    1635
Chungcheongnam-do    1635
Daegu                1635
Daejeon              1635
Gangwon-do           1635
Gwangju              1635
Gyeonggi-do          1635
Gyeongsangbuk-do     1635
Gyeongsangnam-do     1635
Incheon              1635
Jeju-do              1635
Jeollabuk-do         1635
Jeollanam-do         1635
Sejong               1635
Seoul                1635
Ulsan                1635
dtype: int64
True
False
sex                  1634
age                  1634
state                1634
Treatment Days       1634
Chungcheongbuk-do    1634
Chungcheongnam-do    1634
Daegu                1634
Daejeon              1634
Gangwon-do           1634
Gwangju              1634
Gyeonggi-do          1634
Gyeongsangbuk-do     1634
Gyeongsangnam-do     1634
Incheon              1634
Jeju-do              1634
Jeollabuk-do         1634
Jeollanam-do         1634
Sejong               1634
Seoul                1634
Ulsan                1634
dtype: int64
False
True

Building Machine Learning Models Part 1

In [78]:
x = dataForModellingForCleaning.drop("state", axis=1)
y = dataForModellingForCleaning["state"]
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.33, random_state=42)

Stochastic Gradient Descent (SGD):

In [79]:
sgd = linear_model.SGDClassifier(max_iter=5, tol=None)
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)

sgd.score(X_train, Y_train)

acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)

Random Forest:

In [80]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

Logistic Regression:

In [81]:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)

Y_pred = logreg.predict(X_test)

acc_log = round(logreg.score(X_train, Y_train) * 100, 2)

Gaussian Naive Bayes:

In [82]:
gaussian = GaussianNB() 
gaussian.fit(X_train, Y_train)  
Y_pred = gaussian.predict(X_test)  
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)

K Nearest Neighbor:

In [83]:
knn = KNeighborsClassifier(n_neighbors = 3) 
knn.fit(X_train, Y_train)  
Y_pred = knn.predict(X_test)  
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)

Perceptron:

In [84]:
perceptron = Perceptron(max_iter=5)
perceptron.fit(X_train, Y_train)

Y_pred = perceptron.predict(X_test)

acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)

Linear Support Vector Machine:

In [85]:
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)

Y_pred = linear_svc.predict(X_test)

acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)

Decision Tree

In [86]:
decision_tree = DecisionTreeClassifier() 
decision_tree.fit(X_train, Y_train)  
Y_pred = decision_tree.predict(X_test)  
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

Getting the best model

In [87]:
results = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 
              'Decision Tree'],
    'Score': [acc_linear_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_decision_tree]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head(9)
Out[87]:
Model
Score
99.63 Random Forest
99.63 Decision Tree
98.54 KNN
98.45 Support Vector Machines
98.17 Logistic Regression
98.17 Stochastic Gradient Decent
97.90 Perceptron
38.48 Naive Bayes

Decision Tree Diagram

In [88]:
feature_cols = x.columns

dot_data = StringIO()
export_graphviz(decision_tree, out_file = dot_data, 
                      feature_names = feature_cols,  
                     filled = True, rounded = True,  
                    special_characters = True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('covidTree.png')
Image(graph.create_png())

# Entropy 0 == no disorder // perfect knowledge, perfect classification
Out[88]:

What is the most important feature ?

In [89]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances1 = importances.sort_values('importance',ascending=False).set_index('feature')
importances1.head(15)
Out[89]:
importance
feature
Treatment Days 0.594
age 0.207
Daegu 0.116
sex 0.025
Gyeongsangbuk-do 0.016
Seoul 0.009
Gangwon-do 0.009
Chungcheongnam-do 0.006
Ulsan 0.005
Incheon 0.004
Gyeongsangnam-do 0.003
Chungcheongbuk-do 0.002
Gyeonggi-do 0.002
Daejeon 0.001
Gwangju 0.000
In [90]:
barData = importances.reset_index()
fig = px.bar(barData, x='feature', y='importance')
fig.update_layout(hovermode='x')
fig.show()

Confusion Matrix with Precision & Recall & F-Score

In [91]:
predictions = cross_val_predict(random_forest, X_train, Y_train, cv=3)
result1_train = confusion_matrix(Y_train, predictions)
display(result1_train)

p1_train = precision_score(Y_train, predictions)
r1_train = recall_score(Y_train, predictions)
f1_train = f1_score(Y_train, predictions)
print("Precision:", p1_train)
print("Recall:", r1_train)
print("F-Score:", f1_train)
array([[1042,    7],
       [  15,   30]])
Precision: 0.8108108108108109
Recall: 0.6666666666666666
F-Score: 0.7317073170731707
In [92]:
predictions = cross_val_predict(random_forest, X_test, Y_test, cv=3)
result1_test = confusion_matrix(Y_test, predictions)
display(result1_test)

p1_test = precision_score(Y_test, predictions)
r1_test = recall_score(Y_test, predictions)
f1_test = f1_score(Y_test, predictions)
print("Precision:", p1_test)
print("Recall:", r1_test)
print("F-Score:", f1_test)
array([[514,   4],
       [  9,  13]])
Precision: 0.7647058823529411
Recall: 0.5909090909090909
F-Score: 0.6666666666666667

Building Machine Learning Models Part 2

Lets observe what happened if we drop the treatment days column Will the model be more accurate?

In [93]:
x2 = dataForModellingForCleaning.drop(["state","Treatment Days"], axis=1)
y2 = dataForModellingForCleaning["state"]
X_train, X_test, Y_train, Y_test = train_test_split(x2, y2, test_size=0.33, random_state=42)

Random Forest

In [94]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

Decision Tree

In [95]:
decision_tree = DecisionTreeClassifier() 
decision_tree.fit(X_train, Y_train)  
Y_pred = decision_tree.predict(X_test)  
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

Getting the best model

In [96]:
results = pd.DataFrame({
    'Model': ['Random Forest','Decision Tree'],
    'Score': [acc_random_forest, acc_decision_tree]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head(2)
Out[96]:
Model
Score
97.07 Random Forest
97.07 Decision Tree

Decision Tree Diagram

In [97]:
feature_cols = x2.columns

dot_data = StringIO()
export_graphviz(decision_tree, out_file = dot_data, 
                      feature_names = feature_cols,  
                     filled = True, rounded = True,  
                    special_characters = True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('covidTree2.png')
Image(graph.create_png())
Out[97]:

Importance

In [98]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances2 = importances.sort_values('importance',ascending=False).set_index('feature')
importances2.head(15)
Out[98]:
importance
feature
age 0.430
Daegu 0.334
sex 0.082
Gangwon-do 0.043
Gyeongsangbuk-do 0.043
Ulsan 0.012
Chungcheongbuk-do 0.011
Chungcheongnam-do 0.008
Seoul 0.008
Gyeonggi-do 0.007
Incheon 0.007
Gyeongsangnam-do 0.005
Jeollanam-do 0.004
Daejeon 0.002
Gwangju 0.002
In [99]:
barData = importances.reset_index()
fig = px.bar(barData, x='feature', y='importance')
fig.update_layout(hovermode='x')
fig.show()

Confusion Matrix with Precision & Recall & F-Score

In [100]:
predictions = cross_val_predict(random_forest, X_train, Y_train, cv=3)
result2_train = confusion_matrix(Y_train, predictions)
display(result2_train)

p2_train = precision_score(Y_train, predictions)
r2_train = recall_score(Y_train, predictions)
f2_train = f1_score(Y_train, predictions)
print("Precision:", p2_train)
print("Recall:", r2_train)
print("F-Score:", f2_train)
array([[1047,    2],
       [  34,   11]])
Precision: 0.8461538461538461
Recall: 0.24444444444444444
F-Score: 0.37931034482758624
In [101]:
predictions = cross_val_predict(random_forest, X_test, Y_test, cv=3)
result2_test = confusion_matrix(Y_test, predictions)
display(result2_test)

p2_test = precision_score(Y_test, predictions)
r2_test = recall_score(Y_test, predictions)
f2_test = f1_score(Y_test, predictions)
print("Precision:", p2_test)
print("Recall:", r2_test)
print("F-Score:", f2_test)
array([[510,   8],
       [ 13,   9]])
Precision: 0.5294117647058824
Recall: 0.4090909090909091
F-Score: 0.46153846153846156

Summary between the old and new model

The old model is better than the new model even though it has a lower precision rate.
The number of survivor in COVID-19 is much higher than the number of deceased.
Hence, it is much harder to predict who will pass away from the infection.

The new model is unable to predict whether a patient will pass away.
The number of false negatives outweigh the number of true negatives.

On the other hand, the old model is able to predict the deceased much accurately than the new model.
With that in mind, this also indicated that the number of days of treatment is vital.

Step to take to control the COVID-19 situation even better

Generally, those who passed away from the infection had lesser days of treatment. However, is it right for us to conclude that lesser amount of treatment days, the higher the fatality rate? Are the risk of dying higher during first few days/weeks of treatment? Or is there something more to it?

Abraham Wald and the Missing Bullet Holes

This situation is something similar to Abraham Wald and the Missing Bullet Holes. @penguinpress summed up the situation perfectly:

You don't want your planes to get shot down by enemy fighters, so you armor them. But armor makes the plane heavier, and heavier planes are less maneuverable and use more fuel. Armoring the planes too much is a problem; armoring the planes too little is a problem. Somewhere in between there's an optimum.

When American planes came back from engagements over Europe, they were covered in bullet holes. But the damage wasn't uniformly distributed across the aircraft. There were more bullet holes in the fuselage, not so many in the engines.

title

At first glance, it seems reasonable to focus the armor on the fuselage. However, the armor, said Wald, doesn't go where the bullet holes are. It goes where the bullet holes aren't: on the engines.

The missing bullet holes were on the missing planes. The reason planes were coming back with fewer hits to the engine is that planes that got hit in the engine weren't coming back.

If you go to the recovery room at the hospital, you'll see a lot more people with bullet holes in their legs than people with bullet holes in their chests. But that's not because people don't get shot in the chest; it's because the people who get shot in the chest don't recover.

The link to the excellent article written by @penguinpress is in the credit section below.

Survivorship Bias

In this case, our model only consists of people who have passed away after they have gone through treatment. We have excluded people who have died even before they have the opportunity to go through treatment! 

Even after we have excluded those who have not undergo treatment before they passed away, our model indicated that the number of treatment days is the strongest factor among gender, location and age. Treatment days has a whooping percentage of 60% in terms of importance next to age which has a importance of 20%.

Improving the situation

The chances of survival increases as the number of treatment days increases. Hence, the faster detection rate, the earlier the treatment, the more days and opportunities to treat the infection, the better the chances of survival. The Korea government put a very strong emphasis on COVID-19 detection programme and hence they managed to contain the COVID-19 situation.

Various cities among the world were also in lockdown to prevent the rapid spread of COVID-19. The healthcare system will be overwhelmed if COVID-19 were to spread widely which will cause the death rate to escalate as more people will have lesser opportunity to undergo treatment.

The older the person, the higher the chances of infection and death. Hence it is not advisable for elderly people to roam around unless it is necessary.

In terms of locations, the majority of the cases came from Daegu and the sources of infection is from places such as church and clubs. The higher the amount of contact, the greater the chances of infection. It is advisable for the authority to close such areas until a vaccine for COVID-19 is found. During May to June, there is an increase in COVID-19 cases due to complacency.

https://www.aa.com.tr/en/asia-pacific/s-korea-sees-mass-covid-19-cases-linked-to-night-clubs/1838031

https://www.channelnewsasia.com/news/asia/south-korea-covid-19-church-backlash-13092284

Bonus: Refine Dataset for Machine Learning

Will the model's accuracy improve if we include patient who died without any treatment?

In [102]:
patientData = pd.read_csv('covid/patientinfo.csv')
patientModellingData = patientData[['sex','age','province','confirmed_date','released_date','deceased_date','state']]
deadPatientModellingData = patientData[patientData['state'] == 'deceased']
totalDiedPatient = deadPatientModellingData.count()
display(totalDiedPatient)

display('***********')

pd.set_option('display.max_rows', deadPatientModellingData.shape[0]+1)
totalDiedPaitentWithoutTreatment = deadPatientModellingData[deadPatientModellingData['deceased_date'].isnull() == True].count()
display(totalDiedPaitentWithoutTreatment)
patient_id            78
sex                   75
age                   75
country               78
province              78
city                  59
infection_case        36
infected_by            3
contact_number         7
symptom_onset_date     6
confirmed_date        78
released_date          3
deceased_date         66
state                 78
dtype: int64
'***********'
patient_id            12
sex                    9
age                    9
country               12
province              12
city                  12
infection_case         8
infected_by            2
contact_number         4
symptom_onset_date     4
confirmed_date        12
released_date          1
deceased_date          0
state                 12
dtype: int64

Assuming patient who died without any released_date or deceased_date are those died without any treatment

There are about 12 patient who died without any treatment

In [103]:
str(12/78 * 100) + '%'
Out[103]:
'15.384615384615385%'

This accounts for 15% of the total death

Preparing the data for modelling

We will simply add another line of logic to return treatment days = 0
if the patient is deceased and both released and deceased date are empty

In addition, we will add in the month when the patient is confirmed to have covid cases

In [104]:
patientData = pd.read_csv('covid/patientinfo.csv')
patientModellingData = patientData[['sex','age','province','confirmed_date','released_date','deceased_date','state']]

## We removed isolated patient since it is not confirmed if they survived or not
patientModellingData = patientModellingData[patientModellingData['state'] != 'isolated']
nullList = ['sex','age','confirmed_date']
for item in nullList:
     patientModellingData = patientModellingData[~patientModellingData[item].isnull()]
        
display(patientModellingData.head())
sex age province confirmed_date released_date deceased_date state
0 male 50s Seoul 2020-01-23 2020-02-05 NaN released
1 male 30s Seoul 2020-01-30 2020-03-02 NaN released
2 male 50s Seoul 2020-01-30 2020-02-19 NaN released
3 male 20s Seoul 2020-01-30 2020-02-15 NaN released
4 female 20s Seoul 2020-01-31 2020-02-24 NaN released
In [105]:
cols = ['released_date', 'deceased_date', 'confirmed_date']
patientModellingData[cols] = patientModellingData[cols].apply(pd.to_datetime, errors='coerce', axis=1)

def calculate_number_of_treatment_days(row):
    if row["released_date"] is not pd.NaT:
        treatmentDays = pd.to_numeric((row['released_date'] - row['confirmed_date']).days)
        return(treatmentDays)
    elif row["deceased_date"] is not pd.NaT:
        treatmentDays = pd.to_numeric((row['deceased_date'] - row['confirmed_date']).days)
        return(treatmentDays)
    elif row["deceased_date"] is pd.NaT and row["released_date"] is pd.NaT:
        if row["state"] == 'deceased' and row["state"] != 'released':
            return 0
        else:
            return None
    else:
        return None

patientModellingData['Treatment Days'] = patientModellingData.apply(calculate_number_of_treatment_days, axis=1)

patientModellingData = patientModellingData[~patientModellingData['Treatment Days'].isnull()]
patientModellingDataTreatment = patientModellingData[['sex','age','confirmed_date','province','state','Treatment Days']]
patientModellingDataTreatment['confirmed_month'] = patientModellingDataTreatment['confirmed_date'].dt.month
patientModellingDataTreatment = patientModellingDataTreatment.drop('confirmed_date', axis=1)
display(patientModellingDataTreatment.head())
sex age province state Treatment Days confirmed_month
0 male 50s Seoul released 13.0 1
1 male 30s Seoul released 32.0 1
2 male 50s Seoul released 20.0 1
3 male 20s Seoul released 16.0 1
4 female 20s Seoul released 24.0 1
In [106]:
genders = {"male": 0, "female": 1}
patientModellingDataTreatment['sex'] = patientModellingDataTreatment['sex'].map(genders)

state = {"released": 0, "deceased": 1}
patientModellingDataTreatment['state'] = patientModellingDataTreatment['state'].map(state)

age = {'0s':0, '10s':1, '20s':2, '30s':3, '40s':4, '50s':5, '60s':6, '70s':7, '80s':8, '90s':9}
patientModellingDataTreatment['age'] = patientModellingDataTreatment['age'].map(age)

display(patientModellingDataTreatment.head())
sex age province state Treatment Days confirmed_month
0 0 5.0 Seoul 0 13.0 1
1 0 3.0 Seoul 0 32.0 1
2 0 5.0 Seoul 0 20.0 1
3 0 2.0 Seoul 0 16.0 1
4 1 2.0 Seoul 0 24.0 1
In [107]:
dataForModelling = patientModellingDataTreatment.join(provinceDummy)
dataForModelling = dataForModelling.drop('province',axis=1)
dataForModelling
Out[107]:
sex age state Treatment Days confirmed_month Chungcheongbuk-do Chungcheongnam-do Daegu Daejeon Gangwon-do ... Gyeonggi-do Gyeongsangbuk-do Gyeongsangnam-do Incheon Jeju-do Jeollabuk-do Jeollanam-do Sejong Seoul Ulsan
0 0 5.0 0 13.0 1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0 3.0 0 32.0 1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 0 5.0 0 20.0 1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
3 0 2.0 0 16.0 1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 1 2.0 0 24.0 1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5156 0 3.0 0 46.0 4 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
5157 1 2.0 0 32.0 4 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
5158 1 1.0 0 12.0 4 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
5159 1 3.0 0 34.0 5 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
5160 1 3.0 0 14.0 5 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0

1643 rows × 21 columns

Building Machine Learning Models Part 3

In [108]:
clean_dataset(dataForModelling)
display(dataForModelling.count())
print(np.any(np.isnan(dataForModelling)))
print(np.all(np.isfinite(dataForModelling)))

x3 = dataForModelling.drop(["state"], axis=1)
y3 = dataForModelling["state"]
X_train, X_test, Y_train, Y_test = train_test_split(x3, y3, test_size=0.33, random_state=42)

## Random_forest
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

## Decision Tree
decision_tree = DecisionTreeClassifier() 
decision_tree.fit(X_train, Y_train)  
Y_pred = decision_tree.predict(X_test)  
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

results = pd.DataFrame({
    'Model': ['Random Forest','Decision Tree'],
    'Score': [acc_random_forest, acc_decision_tree]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head(2)
sex                  1634
age                  1634
state                1634
Treatment Days       1634
confirmed_month      1634
Chungcheongbuk-do    1634
Chungcheongnam-do    1634
Daegu                1634
Daejeon              1634
Gangwon-do           1634
Gwangju              1634
Gyeonggi-do          1634
Gyeongsangbuk-do     1634
Gyeongsangnam-do     1634
Incheon              1634
Jeju-do              1634
Jeollabuk-do         1634
Jeollanam-do         1634
Sejong               1634
Seoul                1634
Ulsan                1634
dtype: int64
False
True
Out[108]:
Model
Score
99.73 Random Forest
99.73 Decision Tree

Importances

In [109]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances3 = importances.sort_values('importance',ascending=False).set_index('feature')
importances3.head(15)
Out[109]:
importance
feature
Treatment Days 0.552
age 0.217
Daegu 0.112
confirmed_month 0.037
sex 0.020
Gyeongsangbuk-do 0.019
Seoul 0.010
Gangwon-do 0.009
Ulsan 0.006
Chungcheongnam-do 0.006
Gyeongsangnam-do 0.002
Incheon 0.002
Chungcheongbuk-do 0.002
Gyeonggi-do 0.002
Gwangju 0.001
In [110]:
barData = importances.reset_index()
fig = px.bar(barData, x='feature', y='importance')
fig.update_layout(hovermode='x')
fig.show()

Confusion Matrix with Precision & Recall & F-Score

In [111]:
predictions = cross_val_predict(random_forest, X_train, Y_train, cv=3)
result3_train = confusion_matrix(Y_train, predictions)
display(result3_train)

p3_train = precision_score(Y_train, predictions)
r3_train = recall_score(Y_train, predictions)
f3_train = f1_score(Y_train, predictions)
print("Precision:", p3_train)
print("Recall:", r3_train)
print("F-Score:",f3_train)
array([[1044,    5],
       [  17,   28]])
Precision: 0.8484848484848485
Recall: 0.6222222222222222
F-Score: 0.7179487179487178
In [112]:
predictions = cross_val_predict(random_forest, X_test, Y_test, cv=3)
result3_test = confusion_matrix(Y_test, predictions)
display(result3_test)

p3_test = precision_score(Y_test, predictions)
r3_test = recall_score(Y_test, predictions)
f3_test = f1_score(Y_test, predictions)
print("Precision:", p3_test)
print("Recall:", r3_test)
print("F-Score:", f3_test)
array([[515,   3],
       [  9,  13]])
Precision: 0.8125
Recall: 0.5909090909090909
F-Score: 0.6842105263157896

Precision Recall Curve

Precision/recall tradeoff

In [113]:
# getting the probabilities of our predictions
y_scores = random_forest.predict_proba(X_train)
y_scores = y_scores[:,1]

precision, recall, threshold = precision_recall_curve(Y_train, y_scores)
def plot_precision_and_recall(precision, recall, threshold):
    plt.plot(threshold, precision[:-1], "r-", label="precision", linewidth=5)
    plt.plot(threshold, recall[:-1], "b", label="recall", linewidth=5)
    plt.xlabel("threshold", fontsize=10)
    plt.legend(loc="upper right", fontsize=10)
    plt.ylim([0, 1])

plt.figure(figsize=(10, 5))
plot_precision_and_recall(precision, recall, threshold)
plt.show()
In [114]:
def plot_precision_vs_recall(precision, recall):
    plt.plot(recall, precision, "g--", linewidth=2.5)
    plt.ylabel("recall", fontsize=19)
    plt.xlabel("precision", fontsize=19)
    plt.axis([0, 1.5, 0, 1.5])

plt.figure(figsize=(10, 5))
plot_precision_vs_recall(precision, recall)

plt.show()

Final Summary

In [115]:
print('******************************')
print('Model without treatment day: ')
print('importance:', importances2.head(3))
display(result2_train)
display(result2_test)

print('Model with treatment day :')
print('importance:', importances1.head(3))
display(result1_train)
display(result1_test)

print('******************************')
print('Model with treatment day, those who died without treatment and infected month:')
print('importance:', importances3.head(3))
display(result3_train)
display(result3_test)

print('******************************')
print('Model without treatment day: ')
print("Train Data Precision:", p2_train)
print("Train Data Recall:", r2_train)
print("Train Data F-Score:", f2_train)
print("Test Data Precision:", p2_test)
print("Test Data Recall:", r2_test)
print("Test Data F-Score:", f2_test)

print('******************************')
print('Model with treatment day :')
print("Train Data Precision:", p1_train)
print("Train Data Recall:", r1_train)
print("Train Data F-Score:", f1_train)
print("Test Data Precision:", p1_test)
print("Test Data Recall:", r1_test)
print("Test Data F-Score:", f1_test)

print('******************************')
print('Model with treatment day, those who died without treatment and infected month:')
print("Train Data Precision:", p3_train)
print("Train Data Recall:", r3_train)
print("Train Data F-Score:", f3_train)
print("Test Data Precision:", p3_test)
print("Test Data Recall:", r3_test)
print("Test Data F-Score:", f3_test)
******************************
Model without treatment day: 
importance:          importance
feature            
age           0.430
Daegu         0.334
sex           0.082
array([[1047,    2],
       [  34,   11]])
array([[510,   8],
       [ 13,   9]])
Model with treatment day :
importance:                 importance
feature                   
Treatment Days       0.594
age                  0.207
Daegu                0.116
array([[1042,    7],
       [  15,   30]])
array([[514,   4],
       [  9,  13]])
******************************
Model with treatment day, those who died without treatment and infected month:
importance:                 importance
feature                   
Treatment Days       0.552
age                  0.217
Daegu                0.112
array([[1044,    5],
       [  17,   28]])
array([[515,   3],
       [  9,  13]])
******************************
Model without treatment day: 
Train Data Precision: 0.8461538461538461
Train Data Recall: 0.24444444444444444
Train Data F-Score: 0.37931034482758624
Test Data Precision: 0.5294117647058824
Test Data Recall: 0.4090909090909091
Test Data F-Score: 0.46153846153846156
******************************
Model with treatment day :
Train Data Precision: 0.8108108108108109
Train Data Recall: 0.6666666666666666
Train Data F-Score: 0.7317073170731707
Test Data Precision: 0.7647058823529411
Test Data Recall: 0.5909090909090909
Test Data F-Score: 0.6666666666666667
******************************
Model with treatment day, those who died without treatment and infected month:
Train Data Precision: 0.8484848484848485
Train Data Recall: 0.6222222222222222
Train Data F-Score: 0.7179487179487178
Test Data Precision: 0.8125
Test Data Recall: 0.5909090909090909
Test Data F-Score: 0.6842105263157896

References

Soh, Z. (2020). 02a.bank-marketing.ipynb [Scholarly project]. Retrieved 2020.

Datartist. (2020, July 13). Data Science for COVID-19 (DS4C). Retrieved September 16, 2020, from https://www.kaggle.com/kimjihoo/coronavirusdataset

Press, P. (2016, July 14). Abraham Wald and the Missing Bullet Holes. Retrieved September 16, 2020, from https://medium.com/@penguinpress/an-excerpt-from-how-not-to-be-wrong-by-jordan-ellenberg-664e708cfc3d

Anand. (2019, March 11). Network Graph with AT&T data using Plotly. Retrieved September 16, 2020, from https://medium.com/@anand0427/network-graph-with-at-t-data-using-plotly-a319f9898a02

Donges, N. (2018, May 15). Predicting the Survival of Titanic Passengers. Retrieved September 16, 2020, from https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8

Rostami, D. (2020, March 22). Coronavirus Time Series Map Animation. Retrieved September 16, 2020, from https://datacrayon.com/posts/statistics/data-is-beautiful/coronavirus-time-series-map-animation/

Appendix

Case.csv

  • case_id
  • province
  • city
  • group
  • infection_case
  • confirmed
  • latitude
  • longitude

PatientInfo.csv

  • patient_id
  • sex
  • age
  • country
  • province
  • city
  • infection_case
  • infected_by
  • contact_number
  • symptom_onset_date
  • confirmed_date
  • released_date
  • deceased_date
  • state

Policy.csv

  • policy_id
  • country
  • type
  • gov_policy
  • detail
  • start_date
  • end_date

Region.csv

  • code
  • province
  • city
  • latitude
  • longitude
  • elementary_school_count
  • kindergarten_count
  • university_count
  • academy_ratio
  • elderly_population_ratio
  • elderly_alone_ratio
  • nursing_home_count

Time.csv

  • date
  • time
  • test
  • negative
  • confirmed
  • released
  • deceased

TimeAge.csv

  • date
  • time
  • age
  • confirmed
  • deceased

TimeGender.csv

  • date
  • time
  • sex
  • confirmed
  • deceased

TimeProvince.csv

  • date
  • time
  • province
  • confirmed
  • released
  • deceased

Weather.csv

  • code
  • province
  • date
  • avg_temp
  • min_temp
  • max_temp
  • precipitation
  • max_wind_speed
  • most_wind_direction
  • avg_relative_humidity

Contribution Statements

You Yu Quan

Tan Zhi Yang

Yuan Yong Xi (Yannis)

Yap Chee Ann (Victor)

In [ ]: